Text is my personal favorite medium for machine learning. Here is why:
In computing, a picture is worth a (few hundred) thousand words. As a result, modeling text is more space and compute efficient than visual models.
Text arrived first to the internet. This lead time has resulted in better algorithms, and bottomless data.

Interpretability of text is also more straightforward than with images: it must follow the grammar of the language.
Today, I will survey a few methods of generating text.
Markov model
Creating a generative text model is as simple as iteratively predicting what token is going to come next. Markov chains are very good for this task. They have many applications, from Google’s PageRank algorithm, to biological sequence alignment in bioinformatics.

Generating color names
For this task, we can use markovify. I also used SentencePiece as my tokenizer, and os to extract the basename from the path.
import os
import markovify
import sentencepiece as spm
First I trained the tokenizer, with the code from my post on word segmentation. You’ll have to get a text file with one document entry per line. I used David Aerne’s color-names repository, with the hex color codes stripped out.
file_name = 'colornames.txt'
name = os.path.splitext(os.path.basename(file_name))[0]
Next, train the tokenizer.
if not os.path.isfile('{0}.model'.format(name)):
spm.SentencePieceTrainer.Train('--input={0} --model_prefix={1} --vocab_size=2000 --split_by_whitespace=False'.format(file_name,name))
sp = spm.SentencePieceProcessor()
sp.load('{0}.model'.format(name))
Markovify knows nothing about the tokenization method, so we’ll have to save the tokenized output to a text file.
if not os.path.isfile('{0}_tokenized.txt'.format(name)):
with open('{0}_tokenized.txt'.format(name),'w+') as f:
for line in open(file_name):
f.write(' '.join(str(x) for x in sp.NBestEncodeAsIds(line.strip(),20)[0])+'\n')
Next, we can generate new sentences! The NewlineText model tells markovify to count each line in the text file as a new sentence.
generated_text_count = 0
existing_texts = open(file_name).read().split('\n')
while generated_text_count < 100:
with open('{0}_tokenized.txt'.format(name)) as f:
text = f.read()
text_model = markovify.NewlineText(text)
tokens = text_model.make_sentence()
if tokens:
generated_text = ''.join(sp.DecodeIds([int(x) for x in tokens.split()]))
if generated_text not in existing_texts and '⁇' not in generated_text:
print(generated_text)
generated_text_count += 1
Here is the result:
Coffee Adept |
Ebbling Brook |
Snappy Violet |
Nimbus Blue |
Candlewick Green |
Stizmo |
Cherokee Dignity |
Bucco Tan |
Light Budgie Blast |
Pink Quince Jelly |
Pinwheki Green |
Satoire |
Salvia Divincial Ice |
Red Rooster Combed |
Raging Leaf |
Cortezumi Grey |
Royal Hunter Blue |
Poincial |
Arts and Sweet |
Aquifera |
Jīn Huáng Sè Black |
Angel Faceland Grass |
Thu Blue |
Pageant Song Thrush Egg |
Blue Overdue Grey |
Purple Opulusa Green |
Rainbow’s Glory |
Amayazaki Verdant |
Sparrow’s Fire |
Extraord Truck |
Kourning Trendy Pink |
Flayer Orange |
Seachange Land |
Green Paws |
Astroturine |
Tin Bitcoin |
Light Sprite Gold |
Cloakuroku Green |
Faicha Brown |
Velvet Earn |
Yearsai |
Lavender Savory |
Garden Pickling Spice |
Beyond the Clouds |
Monoloo |
Elemental Blue |
Alien Abdu Trail |
Basilisk Red |
Amphitrino |
Free Species |
Dark Summy Green |
Ryoku-Out Crimson |
Brandi Bear |
Dark Summitasse |
Ombral Grey |
Placidity |
Pale Shale Green |
Candlelit Pea Soup Green |
Crusouchsite Artef Green |
Verminal |
Starfoxatron Purple |
Red Prax Bronze |
Heavy Black Tarpon Bronze |
Kemp Tan |
Cornsnik |
Bottlebrook |
Indulvus Orange |
Fuel Townhouse Taupe |
Golden Hermesan |
Sunday Nick’s Wine |
Aquedu Green |
Edened Brown |
Straw Hat Jellyfish |
Oitake Mushroom |
Pheasant’s Rotun |
Pacific Lineage |
Permafrompbush |
Horrellan Green |
Shikon Gold |
Flickering Sea |
Stained Glasses |
Zames to Earth |
Remembing Tide |
Double Jemima |
Pīlā Black |
Golden Glamora Yellow |
Sunburglove |
Cavern Orchid |
Aruba Lisel Dip |
Drover Pewter |
Starship Tonic |
Camel Cordial |
Crystalsong Orange |
Knugger |
Yellow Urch Bole |
Pencia |
Angel Fai |
Harboretum |
Lowerbird Blue |
Sandpita |
Generating movie reviews
Here is the result using the reviews.txt
movie reviews text file from my post on word segmentation:
This movie shows, which serves in making it all the more creepy. The locations are good, as did Mr Harris, but demented man in a true departure from her days as Spock, who plays the younger son, Sam Fuller defines his first feature Stones in this way. And it should be noted, he breaks up with Cedric and tells that German Shepherd is hysterical, and when somethings lacking in venom, his last student in order to build a bond to rival Breakin 2: How the Grinch Stole and Patty Hear me?” Amazing. But the game has offered in recent years and it looks like they had a great time making this film,Clerks, I started watching.
This review follows the correct syntax, but it is not very coherent.
Generating EULAs
I also tried it on 25 MB worth of end-user license agreements:
GNU GENERAL PUBLIC LICENSE Micrium covered work so as to satisfy simultaneously your obligations under this Section. 9.7. By using the Software on as many computers (e.gnu.org # bound or license with respect to any copies of the Software in systems and software programs licensed # * OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. Your Extensions except as expressly permitted herein) to You and Syncro reserves the right after Symantecs
5.1 Scope and Grant You may not redistribute, convert, or otherwise obtain access to, or misappropriated. The term which is not derived from or Specific License Terms. # 15. Customer means the entity or person who is otherwise qualified to give any third party, for a price no * Redistributions of source code must retain the above copyright nost vzniku ali el cual. This guarantees under the License.
Right to Use and support services that you use, copy, free of charge, to any person or similar purpose) running an Axway application; prohibited. If so, those terms apply. License URL You are not licensed to U.S. federal agency has suspended) for a specified term, or (2) your violation of the rights of any person (or expenditure, a Major Component, or Clip (see https://developer.nvidia.com). (forceable provision is entered. services which competes with the Software. For purposes of this Agreement, and the terms for supplements, EULA shall be construed, interpreted and governed by the laws of the State of Washington. If you acquired the I. In doing so, you must comply with any technical limitations in the software that audit.
Licensee shall not permit others to do any of the foregoing. 9.1 You must not changed and State courts located in Zuf den Professional-Verwanym okres couvant les dommages directs uniquement hauteur de 5,00 $ US$5.00. THE FOREGOING PROVISIONS SHALL BE ENFORCEABLE AGAINST THE OTHER PROVISIONS OF THIS SECTION 9: 8. Limitation of Liability 4. FORCE AND SUPPORT SERVICES. Because this software is as is permitted only in cases involving personal injury) exceed the amount paid for that it any title to the Deliverables and/or granted funds)”) contained in this product/service/LICENSE
“Tools or that you have already distributed, where permitted, and only if the SOFTWARE is is not required to print an announcement including an appropriate $1, and subject to, your rights under the prior license and that you will not utilize, in any other manner, without limitation, consequential or other damages.
Next, let’s try a more sophisticated model using TensorFlow.
Recurrent Neural Network (RNN) model
Next, I adapted code from tensorflow’s Text Generation using a RNN tutorial for use with SentencePiece.
import os
import sentencepiece as spm
import tensorflow as tf
import numpy as np
name = 'pos_train'
vocab_size = 512
if not os.path.isfile('{0}.model'.format(name)):
spm.SentencePieceTrainer.Train('--input={0}.txt --model_prefix={0} --vocab_size={1} --split_by_whitespace=False'.format(name,vocab_size))
sp = spm.SentencePieceProcessor()
sp.load('{0}.model'.format(name))
text_as_int = np.array([item for sublist in [[1]+sp.NBestEncodeAsIds(line.strip(),20)[0]+[2] for line in open('{0}.txt'.format(name))] for item in sublist])
seq_length = 100
examples_per_epoch = text_as_int.size//seq_length
char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int)
sequences = char_dataset.batch(seq_length+1, drop_remainder=True)
def split_input_target(chunk):
input_text = chunk[:-1]
target_text = chunk[1:]
return input_text, target_text
dataset = sequences.map(split_input_target)
BATCH_SIZE = 64
steps_per_epoch = examples_per_epoch//BATCH_SIZE
BUFFER_SIZE = 10000
dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)
embedding_dim = 256
rnn_units = 1024
if tf.test.is_gpu_available():
rnn = tf.keras.layers.CuDNNGRU
else:
import functools
rnn = functools.partial(tf.keras.layers.GRU, recurrent_activation='sigmoid')
def build_model(vocab_size, embedding_dim, rnn_units, batch_size):
model = tf.keras.Sequential([
tf.keras.layers.Embedding(vocab_size, embedding_dim,
batch_input_shape=[batch_size, None]),
rnn(rnn_units,
return_sequences=True,
recurrent_initializer='glorot_uniform',
stateful=True),
tf.keras.layers.Dense(vocab_size)
])
return model
model = build_model(vocab_size=vocab_size, embedding_dim=embedding_dim, rnn_units=rnn_units, batch_size=BATCH_SIZE)
for input_example_batch, target_example_batch in dataset.take(1):
example_batch_predictions = model(input_example_batch)
print(example_batch_predictions.shape, "# (batch_size, sequence_length, vocab_size)")
sampled_indices = tf.random.categorical(example_batch_predictions[0], num_samples=1)
sampled_indices = tf.squeeze(sampled_indices,axis=-1).numpy()
def loss(labels, logits):
return tf.keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits=True)
example_batch_loss = loss(target_example_batch, example_batch_predictions)
print("Prediction shape: ", example_batch_predictions.shape, " # (batch_size, sequence_length, vocab_size)")
print("scalar_loss: ", example_batch_loss.numpy().mean())
model.compile(optimizer='adam', loss=loss)
checkpoint_dir = './training_checkpoints'
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")
checkpoint_callback=tf.keras.callbacks.ModelCheckpoint(filepath=checkpoint_prefix,save_weights_only=True)
EPOCHS=3
history = model.fit(dataset.repeat(), epochs=EPOCHS, steps_per_epoch=steps_per_epoch, callbacks=[checkpoint_callback])
tf.train.latest_checkpoint(checkpoint_dir)
model = build_model(vocab_size, embedding_dim, rnn_units, batch_size=1)
model.load_weights(tf.train.latest_checkpoint(checkpoint_dir))
model.build(tf.TensorShape([1, None]))
model.summary()
# Generate text
num_generate = 1000
temperature = 0.5
start_string=u"This was a very good "
input_eval = sp.EncodeAsIds(start_string)
input_eval = tf.expand_dims(input_eval, 0)
text_generated = []
model.reset_states()
for i in range(num_generate):
predictions = model(input_eval)
predictions = tf.squeeze(predictions, 0)
predictions = predictions / temperature
predicted_id = tf.random.categorical(predictions, num_samples=1)[-1,0].numpy().item()
input_eval = tf.expand_dims([predicted_id], 0)
if predicted_id == 1:
piece = ''
elif predicted_id == 2:
piece = '\n'
else:
piece = sp.IdToPiece(predicted_id).replace('▁',' ')
text_generated.append(piece)
print(start_string + ''.join(text_generated))
Here is the result:
A very different story, and really is well done. Not to mention Richard Pannley (hesit’s doctor) and telling him about the worst of being when they first saw the husband. He has this specialized father, a businessman who buys the land on him. The night in California, Anne tells of a friend of Jeffrey Kent, the movie has become a better place in Hollywood. Since it’s a rush, there are some funny bits that could have been made for kids. But, if you’re looking for, you’ll enjoy it.
The acting is stunning. The direction is solid, with plenty of superb performances. The editing and cinematography are marvelous, and the plot is good. The cinematography is amazing, and the story flows well on the wrap-up. It’s the kind of movie, there are few moments where you’ll cry just starting right.
There’s too many laughs, but of course, but you’ve got the feeling that you’re watching for yourself. And there are absolutely no other things, I’ll say that this is one of the very best Dickens era films. Forget, I’m still trying to buy a copy for 100,000.00. The plot is actually a predictable plot, but it’s nice to see a third rate.
This output demonstrates longer-term dependencies and a much deeper understanding.
State of the art models
In early 2019, OpenAI released a generative text model called GPT-2. The results were so good that OpenAI, a company founded to make AI more open, did not want to release the source code (although a smaller version of the model has been released).
Adam King created a webapp that completes a seed phrase using GPT-2. You can try it out for yourself at talktotransformer.com.

Julien Chaumond extended this idea by using it as a tool for very convincing autocomplete.
Further reading
A lot of this post is based on the work of Janelle Shane. If you liked this post, definitely check her out on Twitter, and at AI Weirdness. The same methods shown here can also generate executable code, renderable SVGs, and even crochet patterns.