Word segmentation in Python using SentencePiece

What is word segmentation?

Word segmentation (also called tokenization) is the process of splitting text into a list of words. Humans can do this pretty easily, but computers need help sometimes. At a higher level, you can think of segmentation as a way of boosting character-level models that also makes them more human-interpretable.

Setup

First, download the IMDB Movie Reviews Dataset from here. Consolidate the reviews into a reviews.txt file, with one review per line.

Basic segmentation methods

The Python standard library comes with many useful methods for strings. The split method is one that can be used for very basic segmentation tasks.

Let’s pick a random movie review.

import re, random

reviews = open('reviews.txt').readlines()
text = random.choice(reviews)
words = re.findall('\w+',text)
print(words)

Split method

Now let’s try it out.

words = text.split()
print(words)

Here is the result:

[‘This’, ‘film’, ‘is’, ‘undoubtedly’, ‘one’, ‘of’, ‘the’, ‘greatest’, ‘landmarks’, ‘in’, ‘history’, ‘of’, ‘cinema.’, ‘By’, ‘seeing’, ‘this’, ‘film,we’, ‘can’, ‘only’, ‘retrospectively’, ‘notice’, ‘that’, ‘world’, ‘cinema’, ‘in’, ‘1950s’, ‘had’, ‘such’, ‘a’, ‘purely’, ‘humanistic’, ‘dramaturgy,such’, ‘a’, ‘strong’, ‘and’, ‘adequate’, ‘use’, ‘of’, ‘sound-image’, ‘montage,and’, ‘almost’, ‘religious’, ‘admiration’, ‘of’, ‘ethical’, ‘choices’, ‘in’, ‘human’, ‘life.’, ‘Cinema’, ‘was’, ‘then’, ‘not’, ‘only’, ‘one’, ‘form’, ‘of’, ‘arts.’, ‘It’, ‘was’, ‘much’, ‘higher’, ‘than’, ‘ordinary’, ‘life’, ‘and’, ‘it’, ‘gave’, ‘many’, ‘people’, ‘hope’, ‘to’, ‘live’, ‘after’, ‘the’, ‘tragic’, ‘war.’, ‘It’, ‘is’, ‘said,’, ‘that’, ‘even’, ‘Picasso’, ‘was’, ‘moved’, ‘and’, ‘cried’, ‘that’, ‘such’, ‘a’, ‘work’, ‘of’, ‘art’, ‘can’, ‘appear’, ‘only’, ‘once’, ‘in’, ‘100’, ‘years!’, ‘Audience’, ‘that’, ‘time’, ‘was’, ‘also’, ‘different.’, ‘I’, ‘read’, ‘that’, ‘after’, ‘seeing’, “Kurosawa’s”, ‘”Ikiru(Live)”‘, ‘in’, ‘its’, ‘first’, ‘release,’, ‘young’, ‘couple’, ‘quietly’, ‘told’, ‘each’, ‘other,”It’, ‘is’, ‘a’, ‘good’, ‘film,’, “isn’t”, ‘it?”.’, ‘I’, ‘think,contemporary’, ‘cinema,’, ‘though’, ‘technically’, ‘developed’, ‘and’, ‘opened’, ‘some’, ‘new’, ‘narrative’, ‘perspective,’, ‘has’, ‘lost’, ‘the’, ‘most’, ‘important—reliance’, ‘of’, ‘audience.Cienma’, ‘was’, ‘once’, ‘really’, ‘the’, ‘most’, ‘popular’, ‘art’, ‘from’, ‘and,’, ‘unlike’, ‘modern’, ‘fine’, ‘arts’, ‘and’, ‘contemporary’, ‘music,gave’, ‘millions’, ‘of’, ‘people’, ‘hope’, ‘and’, ‘ideals.’, ‘In’, ‘this’, ‘point’, ‘of’, ‘view,”Letyat’, ‘zhuravli”‘, ‘must’, ‘be’, ‘in’, ‘the’, ‘pantheon’, ‘of’, ‘classics’, ‘of’, ‘all’, ‘the’, ‘time,’, ‘as’, ‘”City’, ‘light”,”Ikiru”‘, ‘and’, ‘”La’, ‘Strada”.’]

This result isn’t very good, because punctuation is mixed-in with regular words.

Regular expressions

Let’s use re.

words = re.findall('\w+',text)
print(words)

Here is the result:

[‘This’, ‘film’, ‘is’, ‘undoubtedly’, ‘one’, ‘of’, ‘the’, ‘greatest’, ‘landmarks’, ‘in’, ‘history’, ‘of’, ‘cinema’, ‘By’, ‘seeing’, ‘this’, ‘film’, ‘we’, ‘can’, ‘only’, ‘retrospectively’, ‘notice’, ‘that’, ‘world’, ‘cinema’, ‘in’, ‘1950s’, ‘had’, ‘such’, ‘a’, ‘purely’, ‘humanistic’, ‘dramaturgy’, ‘such’, ‘a’, ‘strong’, ‘and’, ‘adequate’, ‘use’, ‘of’, ‘sound’, ‘image’, ‘montage’, ‘and’, ‘almost’, ‘religious’, ‘admiration’, ‘of’, ‘ethical’, ‘choices’, ‘in’, ‘human’, ‘life’, ‘Cinema’, ‘was’, ‘then’, ‘not’, ‘only’, ‘one’, ‘form’, ‘of’, ‘arts’, ‘It’, ‘was’, ‘much’, ‘higher’, ‘than’, ‘ordinary’, ‘life’, ‘and’, ‘it’, ‘gave’, ‘many’, ‘people’, ‘hope’, ‘to’, ‘live’, ‘after’, ‘the’, ‘tragic’, ‘war’, ‘It’, ‘is’, ‘said’, ‘that’, ‘even’, ‘Picasso’, ‘was’, ‘moved’, ‘and’, ‘cried’, ‘that’, ‘such’, ‘a’, ‘work’, ‘of’, ‘art’, ‘can’, ‘appear’, ‘only’, ‘once’, ‘in’, ‘100’, ‘years’, ‘Audience’, ‘that’, ‘time’, ‘was’, ‘also’, ‘different’, ‘I’, ‘read’, ‘that’, ‘after’, ‘seeing’, ‘Kurosawa’, ‘s’, ‘Ikiru’, ‘Live’, ‘in’, ‘its’, ‘first’, ‘release’, ‘young’, ‘couple’, ‘quietly’, ‘told’, ‘each’, ‘other’, ‘It’, ‘is’, ‘a’, ‘good’, ‘film’, ‘isn’, ‘t’, ‘it’, ‘I’, ‘think’, ‘contemporary’, ‘cinema’, ‘though’, ‘technically’, ‘developed’, ‘and’, ‘opened’, ‘some’, ‘new’, ‘narrative’, ‘perspective’, ‘has’, ‘lost’, ‘the’, ‘most’, ‘important’, ‘reliance’, ‘of’, ‘audience’, ‘Cienma’, ‘was’, ‘once’, ‘really’, ‘the’, ‘most’, ‘popular’, ‘art’, ‘from’, ‘and’, ‘unlike’, ‘modern’, ‘fine’, ‘arts’, ‘and’, ‘contemporary’, ‘music’, ‘gave’, ‘millions’, ‘of’, ‘people’, ‘hope’, ‘and’, ‘ideals’, ‘In’, ‘this’, ‘point’, ‘of’, ‘view’, ‘Letyat’, ‘zhuravli’, ‘must’, ‘be’, ‘in’, ‘the’, ‘pantheon’, ‘of’, ‘classics’, ‘of’, ‘all’, ‘the’, ‘time’, ‘as’, ‘City’, ‘light’, ‘Ikiru’, ‘and’, ‘La’, ‘Strada’]

This result isn’t too bad, but there are still some problems. First, there is information loss associated with the absence of punctuation and other non-alphanumeric characters, especially in UTF-8. Second, this method does not consider words that contain spaces. “New York City” could really be considered one word, even though it contains spaces. Let’s see if third party libraries can do any better.

Popular word segmentation libraries

NLTK

NLTK stands for Natural Language Toolkit, initially released in 2001. It was designed for research, and is not typically considered “production-ready”, but has stood the test of time. It is probably the most well-known natural language processing library in Python, and is very good for some tasks, such as parse trees.

Required packages:

nltk.download('punkt')

Tokenization method:

from nltk.tokenize import word_tokenize

tokens = word_tokenize(text)

SpaCy

SpaCy is an industrial strength NLP library with a beautiful API. It is extensible, and includes built-in methods for performing common tasks, such as entity recognition.

Required packages:

python -m spacy download en

Tokenization method:

import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
tokens = [token.text for token in doc]

stanfordnlp

Stanford has been a trailblazer in NLP since the beginning. stanfordnlp is a library written on their Java-based CoreNLP library. Their library currently supports a Bi-Directional Long-Short Term Memory Network with GPU acceleration.

Required packages:

stanfordnlp.download('en')

Tokenization method:

import stanfordnlp

nlp = stanfordnlp.Pipeline()
doc = nlp(text)
print([word.text for sentence in doc.sentences for word in sentence.tokens])

Other libraries

There are other popular ones, such as TextBlob, flair, Intel’s NLP Architect, and AllenNLP, but they don’t use their own segmentation methods. Gensim provides a tokenization function in their utils module as a convenience, but the results are alphanumeric only and lowercase, similar to the regular expression previously mentioned. Most algorithm specific libraries use other methods, for example BERT uses an import of a previously made vocab file.

Unsupervised word segmentation using SentencePiece

Segmentation works in the reverse direction. Given a dictionary of all known words and a token ID sequence, we can reconstruct the original text. This implies there is no information loss in the tokenized version.

Instead of using a large, language-specific package, we can use a grammar-agnostic model that learns a segmentation model from raw bytes. This means the same architecture can be used for many different types of text, including logographic languages like Chinese. We can also skip downloading large language models, and updating new terms is a breeze.

Now let’s do the necessary imports. If you do not have sentencepiece installed, use pip install sentencepiece.

import os
import sentencepiece as spm

Time to train our model.

if not os.path.isfile('reviews.model'):
    spm.SentencePieceTrainer.Train('--input=reviews.txt --model_prefix=reviews --vocab_size=20000 --split_by_whitespace=False')

sp = spm.SentencePieceProcessor()
sp.load('reviews.model')

Let’s inspect our vocabulary, in the newly created pos-train.vocab file.

Co	-10.4889
▁Old	-10.4895
▁bigger	-10.4899
▁or▁two	-10.4903
▁give▁it▁a	-10.4903
▁grace	-10.4904
▁carried	-10.4904
▁love▁and	-10.4912
▁on▁film	-10.4915
oe	-10.4918
▁gags	-10.492
uring	-10.4922
▁legendary	-10.4924
▁department	-10.4936
▁creates	-10.4939
▁as▁a▁child	-10.4943
ress	-10.4948
▁VHS	-10.4951
▁vi	-10.4955
▁profound	-10.4955
▁broke	-10.4964
▁up▁for	-10.4964
▁years▁of	-10.4967
▁but▁I▁think	-10.4968
t▁do	-10.4969
▁heavily	-10.497
▁tender	-10.497
▁contest	-10.4972
▁news	-10.4974
▁fly	-10.4975
▁Britain	-10.4975

Now let’s use the SentencePiece tokenizer model to tokenize an unseen sentence.

sentence = "The quick brown fox jumps over the lazy dog"

sp.EncodeAsPieces(sentence)
>> ['▁The', '▁quick', '▁brown', '▁fox', '▁jumps', '▁over▁the', '▁lazy', '▁dog']

sp.EncodeAsIds(sentence)
>> [25, 3411, 11826, 5786, 8022, 2190, 11302, 2048]

Benchmark comparison

Tokenization time of 1,000 movie reviews using AMD Ryzen 7 CPU, 16GB DDR4 on xubuntu

About the author



Hi, I'm Nathan. I'm an electrical engineer in the Los Angeles area. Keep an eye out for more content being posted soon.


Leave a Reply

Your email address will not be published. Required fields are marked *