Word Segmentation in Python Using SentencePiece

What is word segmentation?

Word segmentation, known as tokenization, involves breaking text into individual words, making it easier for computers to understand and analyze meaning. This task applies to various Natural Language Processing (NLP) applications such as language translation, text summarization, and sentiment analysis.

In this post, we will explore using SentencePiece, a widely used open-source library for word segmentation, to perform word segmentation in Python.

Setup

Before diving into the implementation, we have to download the IMDB Movie Reviews Dataset, which can be found here.

Once the dataset is downloaded, consolidate the reviews into a single file, named reviews.txt, with each review on a separate line.

Unsupervised segmentation

SentencePiece

SentencePiece is an open-source library that allows for unsupervised word segmentation.

Unlike traditional word segmentation methods, SentencePiece is reversible. It can reconstruct the original text given a dictionary of known words and a sequence of token IDs. This means there is no information loss.

SentencePiece is also grammar-agnostic, meaning it can learn a segmentation model from raw bytes. This makes it suitable for many different text types, including logographic languages like Chinese. SentencePiece eliminates the need for large language models, and updating new terms is a breeze.

To use SentencePiece for word segmentation in Python, you must first import the necessary modules. If you do not have sentencepiece installed, use pip install sentencepiece.

import os
import sentencepiece as spm

Once you have the necessary modules imported, you can use SentencePiece to train a model on your text data. The following code will train a model on the “reviews.txt” file and save it as “reviews.model”:

if not os.path.isfile('reviews.model'):
    spm.SentencePieceTrainer.Train('--input=reviews.txt --model_prefix=reviews --vocab_size=20000 --split_by_whitespace=False')

sp = spm.SentencePieceProcessor()
sp.load('reviews.model')

Once you have trained your SentencePiece model, you can use it to tokenize new text. Here’s an example of how to use the SentencePiece tokenizer model for tokenizing an unseen sentence:

sentence = "The quick brown fox jumps over the lazy dog"

sp.EncodeAsPieces(sentence)
>> ['▁The', '▁quick', '▁brown', '▁fox', '▁jumps', '▁over▁the', '▁lazy', '▁dog']

sp.EncodeAsIds(sentence)
>> [25, 3411, 11826, 5786, 8022, 2190, 11302, 2048]

Traditional segmentation

NLTK

One of the most well-known and widely used libraries is the Natural Language Toolkit (NLTK). Initially released in 2001, NLTK was designed for research and is considered a powerful tool for many NLP tasks, such as parse trees. However, it is not typically considered “production-ready”.

To use NLTK for word segmentation, you will need first to download the necessary packages, such as the punkt package:

nltk.download('punkt')

Once you have the required packages installed, you can use the word_tokenize function to tokenize the text into individual words:

from nltk.tokenize import word_tokenize

tokens = word_tokenize(text)

SpaCy

Another popular library for word segmentation in natural language processing and machine learning tasks is SpaCy. This library is considered industrial-strength and has a user-friendly API, making it easy to perform everyday NLP tasks such as entity recognition.

To use SpaCy for word segmentation, you will first need to download the necessary packages by running the following command:

python -m spacy download en

Once the packages are installed, you can use the following code to tokenize the text into individual words:

import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
tokens = [token.text for token in doc]

stanfordnlp

Another popular library for word segmentation in natural language processing and machine learning tasks is stanfordnlp.

This library is built on the Java-based CoreNLP library and offers advanced features such as a Bi-Directional LSTM Network with GPU acceleration.

To use stanfordnlp for word segmentation, you will need to download the necessary packages by running the following command:

stanfordnlp.download('en')

Once the packages are installed, you can use the following code to tokenize the text into individual words:

import stanfordnlp

nlp = stanfordnlp.Pipeline()
doc = nlp(text)
print([word.text for sentence in doc.sentences for word in sentence.tokens])

Other libraries

Other popular libraries, such as TextBlob, flair, Intel’s NLP Architect, and AllenNLP, don’t use their own segmentation methods. Gensim provides a word segmentation function in their utils module as a convenience, but the results are alphanumeric only and lowercase, similar to the regular expression previously mentioned.

Most algorithm-specific libraries use other methods; for example, BERT uses an import of a previously made vocab file.

Basic methods

When it comes to basic segmentation methods in Python, we can use the split() function for basic segmentation tasks.

Split method

First, we’ll pick a random movie review and then use the split() function to separate the text into individual words:

import re, random

reviews = open('reviews.txt').readlines()
text = random.choice(reviews)
words = text.split()
print(words)

The resulting output will be a list of words, separated by the default delimiter, which is whitespace:

[‘This’, ‘film’, ‘is’, ‘undoubtedly’, ‘one’, ‘of’, ‘the’, ‘greatest’, ‘landmarks’, ‘in’, ‘history’, ‘of’, ‘cinema.’, …]

The split() method is a simple approach, but it has its limitations. Punctuation is mixed in with regular words, which makes it less accurate.

Regular expressions

Another method for word segmentation is using regular expressions, which allows for more control and specificity.

Here’s an example:

words = re.findall('\w+',text)
print(words)

The resulting output will be a list of words, with punctuation and other non-alphanumeric characters removed:

[‘This’, ‘film’, ‘is’, ‘undoubtedly’, ‘one’, ‘of’, ‘the’, ‘greatest’, ‘landmarks’, ‘in’, ‘history’, ‘of’, ‘cinema’, …]

This result isn’t too bad, but there are still some problems. This method can result in information loss due to the absence of punctuation and other non-alphanumeric characters, especially in UTF-8.

This method also does not consider words that contain spaces, such as “New York City,” which should be regarded as one word.

Benchmark comparison

To demonstrate the efficiency of SentencePiece, we can compare its performance to other popular word segmentation methods.

The following plot shows the word segmentation time of 1,000 movie reviews using an AMD Ryzen 7 CPU and 16GB DDR4 on Xubuntu.

Segmentation time of 1,000 movie reviews using AMD Ryzen 7 CPU, 16GB DDR4 on Xubuntu

About the author



Hi, I'm Nathan. Thanks for reading! Keep an eye out for more content being posted soon.


Leave a Reply

Your email address will not be published. Required fields are marked *