Exploring Latent Word Vectors using Path Finding

Welcome to my latest post where we dive into the exciting world of natural language processing!

One of the key tasks in NLP is understanding the relationships between words. In my previous post, I created a product recommendation system using word embeddings. Today, we’ll take it a step further and explore how we can use these vectors to find the shortest path between pairs of words.

First, we will discuss the advantages of using A* over other path-finding algorithms. Then, we will implement the A* path-finding algorithm to a dataset of word vectors. Finally, we will review some output examples. Let’s get started!

Dijkstra’s algorithm

Source: wikimedia.org

Dijkstra’s algorithm is a method for finding the shortest path between any two vertices of a graph. This algorithm operates by initially overestimating the distance of each vertex from the starting vertex. Then, it visits each node and its neighbors to find the shortest subpath using a greedy approach. The algorithm continues this process until it reaches the final vertex.

A* search algorithm

Next, we have the A* algorithm. A* is a best-first search algorithm. It takes into consideration three parameters: the cost of moving from the initial state to the current state (g), the estimated cost of moving from the current state to the final state (h), and the sum of g and h (f). By utilizing these parameters, A* is able to efficiently find the shortest path between two vertices.

Implementing path finding between latent vectors

To begin, I trained a word2vec model using the IMDB Movie Reviews Dataset. The IMDB dataset is a collection of 50,000 movie reviews, designed for sentiment classification. The dataset is divided into 25,000 reviews for training and 25,000 for testing.

Next, I implemented the A* pathfinding algorithm, which you can find in my project here. I adapted the algorithm from WikiClassify, a project that later became the popular Chrome extension WikiPopularity.

Results

Now that we’ve trained a word2vec model and implemented a path-finding algorithm, it’s time to see the results.

Our first example is between the words “animated” and “carrying.” The path between these two words demonstrates a progression from the idea of animation, often associated with movement and performance, to the concept of carrying, associated with movement and action.

animated -> carrying
Explored: 68, Frontier: 169, Cost: 3.356
know -> tasty
Explored: 11166, Frontier: 2618, Cost: 7.393

Next, we have the path between “know” and “tasty.” This demonstrates a progression from the idea of knowledge and understanding to the concept of taste and enjoyment.

tasty -> boxing
Explored: 461, Frontier: 1305, Cost: 4.298

In this case, we have the shortest path between the words “animated” and “carrying.” The path taken includes various words related to the arts, such as “aesthetic” and “artistic,” as well as technical and visual words like “cinematic” and “technical.” The final word in the path, “boxing,” relates to chess as a form of sport.

In our final example, we have “animated” and “carrying”. This path highlights a clear transition from words associated with negative and violent actions, such as “threaten” and “killed”, to words associated with technical manipulations, such as “drawing” and “painting”. The final word is “restore”, which bridges the form used in the art sphere.

threaten -> restore
Explored: 4820, Frontier: 2992, Cost: 6.216


About the author



Hi, I'm Nathan. Thanks for reading! Keep an eye out for more content being posted soon.


Leave a Reply

Your email address will not be published. Required fields are marked *