What is tokenization?
Tokenization involves breaking text into individual words, making it easier for computers to understand and analyze meaning. This task applies to various Natural Language Processing (NLP) applications such as language translation, text summarization, and sentiment analysis.
In this post, we will explore using SentencePiece, a widely used open-source library for tokenization in Python.
Setup
Before diving into the implementation, we have to download the IMDB Movie Reviews Dataset, which can be found here.
Once the dataset is downloaded, consolidate the reviews into a single file, named reviews.txt
, with each review on a separate line.
Unsupervised segmentation
SentencePiece
SentencePiece is an open-source library that allows for unsupervised tokenization.
Unlike traditional tokenization methods, SentencePiece is reversible. It can reconstruct the original text given a dictionary of known words and a sequence of token IDs. This means there is no information loss.
SentencePiece is also grammar-agnostic, meaning it can learn a segmentation model from raw bytes. This makes it suitable for many different text types, including logographic languages like Chinese. SentencePiece eliminates the need for large language models, and updating new terms is a breeze.
To use SentencePiece for tokenization in Python, you must first import the necessary modules. If you do not have sentencepiece installed, use pip install sentencepiece
.
import os
import sentencepiece as spm
Once you have the necessary modules imported, you can use SentencePiece to train a model on your text data. The following code will train a model on the “reviews.txt” file and save it as “reviews.model”:
if not os.path.isfile('reviews.model'):
spm.SentencePieceTrainer.Train('--input=reviews.txt --model_prefix=reviews --vocab_size=20000 --split_by_whitespace=False')
sp = spm.SentencePieceProcessor()
sp.load('reviews.model')
Once you have trained your SentencePiece model, you can use it to tokenize new text. Here’s an example of how to use the SentencePiece tokenizer model for tokenizing an unseen sentence:
sentence = "The quick brown fox jumps over the lazy dog"
sp.EncodeAsPieces(sentence)
>> ['▁The', '▁quick', '▁brown', '▁fox', '▁jumps', '▁over▁the', '▁lazy', '▁dog']
sp.EncodeAsIds(sentence)
>> [25, 3411, 11826, 5786, 8022, 2190, 11302, 2048]
Traditional segmentation
NLTK
One of the most well-known and widely used libraries is the Natural Language Toolkit (NLTK). Initially released in 2001, NLTK was designed for research and is considered a powerful tool for many NLP tasks, such as parse trees. However, it is not typically considered “production-ready”.
To use NLTK for tokenization, you will need first to download the necessary packages, such as the punkt package:
nltk.download('punkt')
Once you have the required packages installed, you can use the word_tokenize
function to tokenize the text into individual words:
from nltk.tokenize import word_tokenize
tokens = word_tokenize(text)
SpaCy
Another popular library for tokenization in natural language processing and machine learning tasks is SpaCy. This library is considered industrial-strength and has a user-friendly API, making it easy to perform everyday NLP tasks such as entity recognition.
To use SpaCy for tokenization, you will first need to download the necessary packages by running the following command:
python -m spacy download en
Once the packages are installed, you can use the following code to tokenize the text into individual words:
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
tokens = [token.text for token in doc]
stanfordnlp
Another popular library for tokenization in natural language processing and machine learning tasks is stanfordnlp.
This library is built on the Java-based CoreNLP library and offers advanced features such as a Bi-Directional LSTM Network with GPU acceleration.
To use stanfordnlp for tokenization, you will need to download the necessary packages by running the following command:
stanfordnlp.download('en')
Once the packages are installed, you can use the following code to tokenize the text into individual words:
import stanfordnlp
nlp = stanfordnlp.Pipeline()
doc = nlp(text)
print([word.text for sentence in doc.sentences for word in sentence.tokens])
Other libraries
Other popular libraries, such as TextBlob, flair, Intel’s NLP Architect, and AllenNLP, don’t use their own segmentation methods. Gensim provides a tokenization function in their utils module as a convenience, but the results are alphanumeric only and lowercase, similar to the regular expression previously mentioned.
Most algorithm-specific libraries use other methods; for example, BERT uses an import of a previously made vocab file.
Basic methods
When it comes to basic segmentation methods in Python, we can use the split()
function for basic segmentation tasks.
Split method
First, we’ll pick a random movie review and then use the split()
function to separate the text into individual words:
import re, random
reviews = open('reviews.txt').readlines()
text = random.choice(reviews)
words = text.split()
print(words)
The resulting output will be a list of words, separated by the default delimiter, which is whitespace:
[‘This’, ‘film’, ‘is’, ‘undoubtedly’, ‘one’, ‘of’, ‘the’, ‘greatest’, ‘landmarks’, ‘in’, ‘history’, ‘of’, ‘cinema.’, …]
The split() method is a simple approach, but it has its limitations. Punctuation is mixed in with regular words, which makes it less accurate.
Regular expressions
Another method for tokenization is using regular expressions, which allows for more control and specificity.
Here’s an example:
words = re.findall('\w+',text)
print(words)
The resulting output will be a list of words, with punctuation and other non-alphanumeric characters removed:
[‘This’, ‘film’, ‘is’, ‘undoubtedly’, ‘one’, ‘of’, ‘the’, ‘greatest’, ‘landmarks’, ‘in’, ‘history’, ‘of’, ‘cinema’, …]
This result isn’t too bad, but there are still some problems. This method can result in information loss due to the absence of punctuation and other non-alphanumeric characters, especially in UTF-8.
This method also does not consider words that contain spaces, such as “New York City,” which should be regarded as one word.
Benchmark comparison
To demonstrate the efficiency of SentencePiece, we can compare its performance to other popular tokenization methods.
The following plot shows the tokenization time of 1,000 movie reviews using an AMD Ryzen 7 CPU and 16GB DDR4 on Xubuntu.

I believe the title here is misleading: word segmentation and tokenization are different things, and this article only talks about tokenization
I have just updated the post to reflect this. Thanks, Elena!