Text generation in Python

Text is my personal favorite medium for machine learning. Here is why:

In computing, a picture is worth a (few hundred) thousand words. As a result, modeling text is more space and compute efficient than visual models.

Text arrived first to the internet. This lead time has resulted in better algorithms, and bottomless data.

Early ebay.com. Source: Telegraph.

Interpretability of text is also more straightforward than with images: it must follow the grammar of the language.

Today, I will survey a few methods of generating text.

Markov model

Creating a generative text model is as simple as iteratively predicting what token is going to come next. Markov chains are very good for this task. They have many applications, from Google’s PageRank algorithm, to biological sequence alignment in bioinformatics.

Source: Andrew Walsh

Generating color names

For this task, we can use markovify. I also used SentencePiece as my tokenizer, and os to extract the basename from the path.

import os

import markovify
import sentencepiece as sp

First I trained the tokenizer, with the code from my post on word segmentation. You’ll have to get a text file with one document entry per line. I used David Aerne’s color-names repository, with the hex color codes stripped out.

file_name = 'colornames.txt'
name = os.path.splitext(os.path.basename(file_name))[0]

Next, train the tokenizer.

if not os.path.isfile('{0}.model'.format(name)):
    spm.SentencePieceTrainer.Train('--input={0} --model_prefix={1} --vocab_size=2000 --split_by_whitespace=False'.format(file_name,name))
sp = spm.SentencePieceProcessor()

Markovify knows nothing about the tokenization method, so we’ll have to save the tokenized output to a text file.

if not os.path.isfile('{0}_tokenized.txt'.format(name)):
    with open('{0}_tokenized.txt'.format(name),'w+') as f:
        for line in open(file_name):
            f.write(' '.join(str(x) for x in sp.NBestEncodeAsIds(line.strip(),20)[0])+'\n')

We are now ready to generate new sentences! The NewlineText model tells markovify to count each line in the text file as a new sentence.

generated_text_count = 0
existing_texts = open(file_name).read().split('\n')
while generated_text_count < 100:
    with open('{0}_tokenized.txt'.format(name)) as f:
        text = f.read()
    text_model = markovify.NewlineText(text)
    tokens = text_model.make_sentence()
    if tokens:
        generated_text = ''.join(sp.DecodeIds([int(x) for x in tokens.split()]))
        if generated_text not in existing_texts and '⁇' not in generated_text:
            generated_text_count += 1

Here is the result:

Coffee Adept
Ebbling Brook
Snappy Violet
Nimbus Blue
Candlewick Green
Cherokee Dignity
Bucco Tan
Light Budgie Blast
Pink Quince Jelly
Pinwheki Green
Salvia Divincial Ice
Red Rooster Combed
Raging Leaf
Cortezumi Grey
Royal Hunter Blue
Arts and Sweet
Jīn Huáng Sè Black
Angel Faceland Grass
Thu Blue
Pageant Song Thrush Egg
Blue Overdue Grey
Purple Opulusa Green
Rainbow’s Glory
Amayazaki Verdant
Sparrow’s Fire
Extraord Truck
Kourning Trendy Pink
Flayer Orange
Seachange Land
Green Paws
Tin Bitcoin
Light Sprite Gold
Cloakuroku Green
Faicha Brown
Velvet Earn
Lavender Savory
Garden Pickling Spice
Beyond the Clouds
Elemental Blue
Alien Abdu Trail
Basilisk Red
Free Species
Dark Summy Green
Ryoku-Out Crimson
Brandi Bear
Dark Summitasse
Ombral Grey
Pale Shale Green
Candlelit Pea Soup Green
Crusouchsite Artef Green
Starfoxatron Purple
Red Prax Bronze
Heavy Black Tarpon Bronze
Kemp Tan
Indulvus Orange
Fuel Townhouse Taupe
Golden Hermesan
Sunday Nick’s Wine
Aquedu Green
Edened Brown
Straw Hat Jellyfish
Oitake Mushroom
Pheasant’s Rotun
Pacific Lineage
Horrellan Green
Shikon Gold
Flickering Sea
Stained Glasses
Zames to Earth
Remembing Tide
Double Jemima
Pīlā Black
Golden Glamora Yellow
Cavern Orchid
Aruba Lisel Dip
Drover Pewter
Starship Tonic
Camel Cordial
Crystalsong Orange
Yellow Urch Bole
Angel Fai
Lowerbird Blue


Generating movie reviews

Here is the result using the reviews.txt movie reviews text file from my post on word segmentation:

This movie shows, which serves in making it all the more creepy. The locations are good, as did Mr Harris, but demented man in a true departure from her days as Spock, who plays the younger son, Sam Fuller defines his first feature Stones in this way. And it should be noted, he breaks up with Cedric and tells that German Shepherd is hysterical, and when somethings lacking in venom, his last student in order to build a bond to rival Breakin 2: How the Grinch Stole and Patty Hear me?” Amazing. But the game has offered in recent years and it looks like they had a great time making this film,Clerks, I started watching.

This review follows the correct syntax, but it is not very coherent.

Generating EULAs

I also tried it on 25 MB worth of end-user license agreements:

GNU GENERAL PUBLIC LICENSE Micrium covered work so as to satisfy simultaneously your obligations under this Section. 9.7. By using the Software on as many computers (e.gnu.org # bound or license with respect to any copies of the Software in systems and software programs licensed # * OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. Your Extensions except as expressly permitted herein) to You and Syncro reserves the right after Symantecs 5.1 Scope and Grant You may not redistribute, convert, or otherwise obtain access to, or misappropriated. The term which is not derived from or Specific License Terms. # 15. Customer means the entity or person who is otherwise qualified to give any third party, for a price no * Redistributions of source code must retain the above copyright nost vzniku ali el cual. This guarantees under the License. Right to Use and support services that you use, copy, free of charge, to any person or similar purpose) running an Axway application; prohibited. If so, those terms apply. License URL You are not licensed to U.S. federal agency has suspended) for a specified term, or (2) your violation of the rights of any person (or expenditure, a Major Component, or Clip (see https://developer.nvidia.com). (forceable provision is entered. services which competes with the Software. For purposes of this Agreement, and the terms for supplements, EULA shall be construed, interpreted and governed by the laws of the State of Washington. If you acquired the I. In doing so, you must comply with any technical limitations in the software that audit. Licensee shall not permit others to do any of the foregoing. 9.1 You must not changed and State courts located in Zuf den Professional-Verwanym okres couvant les dommages directs uniquement hauteur de 5,00 $ US$5.00. THE FOREGOING PROVISIONS SHALL BE ENFORCEABLE AGAINST THE OTHER PROVISIONS OF THIS SECTION 9: 8. Limitation of Liability 4. FORCE AND SUPPORT SERVICES. Because this software is as is permitted only in cases involving personal injury) exceed the amount paid for that it any title to the Deliverables and/or granted funds)”) contained in this product/service/LICENSE “Tools or that you have already distributed, where permitted, and only if the SOFTWARE is is not required to print an announcement including an appropriate $1, and subject to, your rights under the prior license and that you will not utilize, in any other manner, without limitation, consequential or other damages.

This example is a good parody of legalese, but it still does not have any long-term dependencies. Let’s try a more sophisticated model using TensorFlow.

Recurrent Neural Network (RNN) model

Next, I adapted code from tensorflow’s Text Generation using a RNN tutorial for use with SentencePiece.

import os

import sentencepiece as spm
import tensorflow as tf
import numpy as np

name = 'pos_train'
vocab_size = 512

if not os.path.isfile('{0}.model'.format(name)):
    spm.SentencePieceTrainer.Train('--input={0}.txt --model_prefix={0} --vocab_size={1} --split_by_whitespace=False'.format(name,vocab_size))

sp = spm.SentencePieceProcessor()

text_as_int = np.array([item for sublist in [[1]+sp.NBestEncodeAsIds(line.strip(),20)[0]+[2] for line in open('{0}.txt'.format(name))] for item in sublist])

seq_length = 100
examples_per_epoch = text_as_int.size//seq_length

char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int)

sequences = char_dataset.batch(seq_length+1, drop_remainder=True)

def split_input_target(chunk):
  input_text = chunk[:-1]
  target_text = chunk[1:]
  return input_text, target_text

dataset = sequences.map(split_input_target)

steps_per_epoch = examples_per_epoch//BATCH_SIZE
dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)
embedding_dim = 256
rnn_units = 1024

if tf.test.is_gpu_available():
  rnn = tf.keras.layers.CuDNNGRU
  import functools
  rnn = functools.partial(tf.keras.layers.GRU, recurrent_activation='sigmoid')

def build_model(vocab_size, embedding_dim, rnn_units, batch_size):
  model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim,
      batch_input_shape=[batch_size, None]),
  return model

model = build_model(vocab_size=vocab_size, embedding_dim=embedding_dim, rnn_units=rnn_units, batch_size=BATCH_SIZE)

for input_example_batch, target_example_batch in dataset.take(1):
  example_batch_predictions = model(input_example_batch)
  print(example_batch_predictions.shape, "# (batch_size, sequence_length, vocab_size)")

sampled_indices = tf.random.categorical(example_batch_predictions[0], num_samples=1)
sampled_indices = tf.squeeze(sampled_indices,axis=-1).numpy()

def loss(labels, logits):
  return tf.keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits=True)

example_batch_loss  = loss(target_example_batch, example_batch_predictions)

print("Prediction shape: ", example_batch_predictions.shape, " # (batch_size, sequence_length, vocab_size)")
print("scalar_loss:      ", example_batch_loss.numpy().mean())

model.compile(optimizer='adam', loss=loss)

checkpoint_dir = './training_checkpoints'
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")
history = model.fit(dataset.repeat(), epochs=EPOCHS, steps_per_epoch=steps_per_epoch, callbacks=[checkpoint_callback])
model = build_model(vocab_size, embedding_dim, rnn_units, batch_size=1)
model.build(tf.TensorShape([1, None]))

# Generate text

num_generate = 1000
temperature = 0.5

start_string=u"This was a very good "
input_eval = sp.EncodeAsIds(start_string)
input_eval = tf.expand_dims(input_eval, 0)
text_generated = []
for i in range(num_generate):
  predictions = model(input_eval)
  predictions = tf.squeeze(predictions, 0)
  predictions = predictions / temperature
  predicted_id = tf.random.categorical(predictions, num_samples=1)[-1,0].numpy().item()
  input_eval = tf.expand_dims([predicted_id], 0)
  if predicted_id == 1:
      piece = ''
  elif predicted_id == 2:
      piece = '\n'
      piece = sp.IdToPiece(predicted_id).replace('▁',' ')
print(start_string + ''.join(text_generated))

Here is the result:

A very different story, and really is well done. Not to mention Richard Pannley (hesit’s doctor) and telling him about the worst of being when they first saw the husband. He has this specialized father, a businessman who buys the land on him. The night in California, Anne tells of a friend of Jeffrey Kent, the movie has become a better place in Hollywood. Since it’s a rush, there are some funny bits that could have been made for kids. But, if you’re looking for, you’ll enjoy it. The acting is stunning. The direction is solid, with plenty of superb performances. The editing and cinematography are marvelous, and the plot is good. The cinematography is amazing, and the story flows well on the wrap-up. It’s the kind of movie, there are few moments where you’ll cry just starting right. There’s too many laughs, but of course, but you’ve got the feeling that you’re watching for yourself. And there are absolutely no other things, I’ll say that this is one of the very best Dickens era films. Forget, I’m still trying to buy a copy for 100,000.00. The plot is actually a predictable plot, but it’s nice to see a third rate.

This output demonstrates longer-term dependencies and a much deeper understanding.

State of the art models

In early 2019, OpenAI released a generative text model called GPT-2. The results were so good that OpenAI, a company founded to make AI more open, did not want to release the source code (although a smaller version of the model has been released).

Adam King created a webapp that completes a seed phrase using GPT-2. You can try it out for yourself at talktotransformer.com.

Julien Chaumond extended this idea by using it as a tool for very convincing autocomplete.

Further reading

A lot of this post is based on the work of Janelle Shane. If you liked this post, definitely check her out on Twitter, and at AI Weirdness. The same methods shown here can also generate executable code, renderable SVGs, and even crochet patterns.

About the author

Hi, I'm Nathan. I'm an electrical engineer in the Los Angeles area. Keep an eye out for more content being posted soon.