What is word segmentation?
Word segmentation (also called tokenization) is the process of splitting text into a list of words. Humans can do this pretty easily, but computers need help sometimes. At a higher level, you can think of segmentation as a way of boosting character-level models that also makes them more human-interpretable.
Setup
First, download the IMDB Movie Reviews Dataset from here. Consolidate the reviews into a reviews.txt
file, with one review per line.
Basic segmentation methods
The Python standard library comes with many useful methods for strings. The split
method is one that can be used for very basic segmentation tasks.
Let’s pick a random movie review.
import re, random reviews = open('reviews.txt').readlines() text = random.choice(reviews) words = re.findall('\w+',text) print(words)
Split method
Now let’s try it out.
words = text.split() print(words)
Here is the result:
[‘This’, ‘film’, ‘is’, ‘undoubtedly’, ‘one’, ‘of’, ‘the’, ‘greatest’, ‘landmarks’, ‘in’, ‘history’, ‘of’, ‘cinema.’, ‘By’, ‘seeing’, ‘this’, ‘film,we’, ‘can’, ‘only’, ‘retrospectively’, ‘notice’, ‘that’, ‘world’, ‘cinema’, ‘in’, ‘1950s’, ‘had’, ‘such’, ‘a’, ‘purely’, ‘humanistic’, ‘dramaturgy,such’, ‘a’, ‘strong’, ‘and’, ‘adequate’, ‘use’, ‘of’, ‘sound-image’, ‘montage,and’, ‘almost’, ‘religious’, ‘admiration’, ‘of’, ‘ethical’, ‘choices’, ‘in’, ‘human’, ‘life.’, ‘Cinema’, ‘was’, ‘then’, ‘not’, ‘only’, ‘one’, ‘form’, ‘of’, ‘arts.’, ‘It’, ‘was’, ‘much’, ‘higher’, ‘than’, ‘ordinary’, ‘life’, ‘and’, ‘it’, ‘gave’, ‘many’, ‘people’, ‘hope’, ‘to’, ‘live’, ‘after’, ‘the’, ‘tragic’, ‘war.’, ‘It’, ‘is’, ‘said,’, ‘that’, ‘even’, ‘Picasso’, ‘was’, ‘moved’, ‘and’, ‘cried’, ‘that’, ‘such’, ‘a’, ‘work’, ‘of’, ‘art’, ‘can’, ‘appear’, ‘only’, ‘once’, ‘in’, ‘100’, ‘years!’, ‘Audience’, ‘that’, ‘time’, ‘was’, ‘also’, ‘different.’, ‘I’, ‘read’, ‘that’, ‘after’, ‘seeing’, “Kurosawa’s”, ‘”Ikiru(Live)”‘, ‘in’, ‘its’, ‘first’, ‘release,’, ‘young’, ‘couple’, ‘quietly’, ‘told’, ‘each’, ‘other,”It’, ‘is’, ‘a’, ‘good’, ‘film,’, “isn’t”, ‘it?”.’, ‘I’, ‘think,contemporary’, ‘cinema,’, ‘though’, ‘technically’, ‘developed’, ‘and’, ‘opened’, ‘some’, ‘new’, ‘narrative’, ‘perspective,’, ‘has’, ‘lost’, ‘the’, ‘most’, ‘important—reliance’, ‘of’, ‘audience.Cienma’, ‘was’, ‘once’, ‘really’, ‘the’, ‘most’, ‘popular’, ‘art’, ‘from’, ‘and,’, ‘unlike’, ‘modern’, ‘fine’, ‘arts’, ‘and’, ‘contemporary’, ‘music,gave’, ‘millions’, ‘of’, ‘people’, ‘hope’, ‘and’, ‘ideals.’, ‘In’, ‘this’, ‘point’, ‘of’, ‘view,”Letyat’, ‘zhuravli”‘, ‘must’, ‘be’, ‘in’, ‘the’, ‘pantheon’, ‘of’, ‘classics’, ‘of’, ‘all’, ‘the’, ‘time,’, ‘as’, ‘”City’, ‘light”,”Ikiru”‘, ‘and’, ‘”La’, ‘Strada”.’]
This result isn’t very good, because punctuation is mixed-in with regular words.
Regular expressions
Let’s use re.
words = re.findall('\w+',text) print(words)
Here is the result:
[‘This’, ‘film’, ‘is’, ‘undoubtedly’, ‘one’, ‘of’, ‘the’, ‘greatest’, ‘landmarks’, ‘in’, ‘history’, ‘of’, ‘cinema’, ‘By’, ‘seeing’, ‘this’, ‘film’, ‘we’, ‘can’, ‘only’, ‘retrospectively’, ‘notice’, ‘that’, ‘world’, ‘cinema’, ‘in’, ‘1950s’, ‘had’, ‘such’, ‘a’, ‘purely’, ‘humanistic’, ‘dramaturgy’, ‘such’, ‘a’, ‘strong’, ‘and’, ‘adequate’, ‘use’, ‘of’, ‘sound’, ‘image’, ‘montage’, ‘and’, ‘almost’, ‘religious’, ‘admiration’, ‘of’, ‘ethical’, ‘choices’, ‘in’, ‘human’, ‘life’, ‘Cinema’, ‘was’, ‘then’, ‘not’, ‘only’, ‘one’, ‘form’, ‘of’, ‘arts’, ‘It’, ‘was’, ‘much’, ‘higher’, ‘than’, ‘ordinary’, ‘life’, ‘and’, ‘it’, ‘gave’, ‘many’, ‘people’, ‘hope’, ‘to’, ‘live’, ‘after’, ‘the’, ‘tragic’, ‘war’, ‘It’, ‘is’, ‘said’, ‘that’, ‘even’, ‘Picasso’, ‘was’, ‘moved’, ‘and’, ‘cried’, ‘that’, ‘such’, ‘a’, ‘work’, ‘of’, ‘art’, ‘can’, ‘appear’, ‘only’, ‘once’, ‘in’, ‘100’, ‘years’, ‘Audience’, ‘that’, ‘time’, ‘was’, ‘also’, ‘different’, ‘I’, ‘read’, ‘that’, ‘after’, ‘seeing’, ‘Kurosawa’, ‘s’, ‘Ikiru’, ‘Live’, ‘in’, ‘its’, ‘first’, ‘release’, ‘young’, ‘couple’, ‘quietly’, ‘told’, ‘each’, ‘other’, ‘It’, ‘is’, ‘a’, ‘good’, ‘film’, ‘isn’, ‘t’, ‘it’, ‘I’, ‘think’, ‘contemporary’, ‘cinema’, ‘though’, ‘technically’, ‘developed’, ‘and’, ‘opened’, ‘some’, ‘new’, ‘narrative’, ‘perspective’, ‘has’, ‘lost’, ‘the’, ‘most’, ‘important’, ‘reliance’, ‘of’, ‘audience’, ‘Cienma’, ‘was’, ‘once’, ‘really’, ‘the’, ‘most’, ‘popular’, ‘art’, ‘from’, ‘and’, ‘unlike’, ‘modern’, ‘fine’, ‘arts’, ‘and’, ‘contemporary’, ‘music’, ‘gave’, ‘millions’, ‘of’, ‘people’, ‘hope’, ‘and’, ‘ideals’, ‘In’, ‘this’, ‘point’, ‘of’, ‘view’, ‘Letyat’, ‘zhuravli’, ‘must’, ‘be’, ‘in’, ‘the’, ‘pantheon’, ‘of’, ‘classics’, ‘of’, ‘all’, ‘the’, ‘time’, ‘as’, ‘City’, ‘light’, ‘Ikiru’, ‘and’, ‘La’, ‘Strada’]
This result isn’t too bad, but there are still some problems. First, there is information loss associated with the absence of punctuation and other non-alphanumeric characters, especially in UTF-8. Second, this method does not consider words that contain spaces. “New York City” could really be considered one word, even though it contains spaces. Let’s see if third party libraries can do any better.
Popular word segmentation libraries
NLTK
NLTK stands for Natural Language Toolkit, initially released in 2001. It was designed for research, and is not typically considered “production-ready”, but has stood the test of time. It is probably the most well-known natural language processing library in Python, and is very good for some tasks, such as parse trees.
Required packages:
nltk.download('punkt')
Tokenization method:
from nltk.tokenize import word_tokenize tokens = word_tokenize(text)
SpaCy
SpaCy is an industrial strength NLP library with a beautiful API. It is extensible, and includes built-in methods for performing common tasks, such as entity recognition.
Required packages:
python -m spacy download en
Tokenization method:
import spacy nlp = spacy.load("en_core_web_sm") doc = nlp(text) tokens = [token.text for token in doc]
stanfordnlp
Stanford has been a trailblazer in NLP since the beginning. stanfordnlp is a library written on their Java-based CoreNLP library. Their library currently supports a Bi-Directional Long-Short Term Memory Network with GPU acceleration.
Required packages:
stanfordnlp.download('en')
Tokenization method:
import stanfordnlp nlp = stanfordnlp.Pipeline() doc = nlp(text) print([word.text for sentence in doc.sentences for word in sentence.tokens])
Other libraries
There are other popular ones, such as TextBlob, flair, Intel’s NLP Architect, and AllenNLP, but they don’t use their own segmentation methods. Gensim provides a tokenization function in their utils module as a convenience, but the results are alphanumeric only and lowercase, similar to the regular expression previously mentioned. Most algorithm specific libraries use other methods, for example BERT uses an import of a previously made vocab file.
Unsupervised word segmentation using SentencePiece
Segmentation works in the reverse direction. Given a dictionary of all known words and a token ID sequence, we can reconstruct the original text. This implies there is no information loss in the tokenized version.
Instead of using a large, language-specific package, we can use a grammar-agnostic model that learns a segmentation model from raw bytes. This means the same architecture can be used for many different types of text, including logographic languages like Chinese. We can also skip downloading large language models, and updating new terms is a breeze.
Now let’s do the necessary imports. If you do not have sentencepiece installed, use pip install sentencepiece
.
import os import sentencepiece as spm
Time to train our model.
if not os.path.isfile('reviews.model'): spm.SentencePieceTrainer.Train('--input=reviews.txt --model_prefix=reviews --vocab_size=20000 --split_by_whitespace=False') sp = spm.SentencePieceProcessor() sp.load('reviews.model')
Let’s inspect our vocabulary, in the newly created pos-train.vocab file.
Co -10.4889 ▁Old -10.4895 ▁bigger -10.4899 ▁or▁two -10.4903 ▁give▁it▁a -10.4903 ▁grace -10.4904 ▁carried -10.4904 ▁love▁and -10.4912 ▁on▁film -10.4915 oe -10.4918 ▁gags -10.492 uring -10.4922 ▁legendary -10.4924 ▁department -10.4936 ▁creates -10.4939 ▁as▁a▁child -10.4943 ress -10.4948 ▁VHS -10.4951 ▁vi -10.4955 ▁profound -10.4955 ▁broke -10.4964 ▁up▁for -10.4964 ▁years▁of -10.4967 ▁but▁I▁think -10.4968 t▁do -10.4969 ▁heavily -10.497 ▁tender -10.497 ▁contest -10.4972 ▁news -10.4974 ▁fly -10.4975 ▁Britain -10.4975
Now let’s use the SentencePiece tokenizer model to tokenize an unseen sentence.
sentence = "The quick brown fox jumps over the lazy dog" sp.EncodeAsPieces(sentence) >> ['▁The', '▁quick', '▁brown', '▁fox', '▁jumps', '▁over▁the', '▁lazy', '▁dog'] sp.EncodeAsIds(sentence) >> [25, 3411, 11826, 5786, 8022, 2190, 11302, 2048]
Benchmark comparison
