Welcome to my latest post where we dive into the exciting world of natural language processing!
One of the key tasks in NLP is understanding the relationships between words. In my previous post, I created a product recommendation system using word embeddings. Today, we’ll take it a step further and explore how we can use these vectors to find the shortest path between pairs of words.
First, we will discuss the advantages of using A* over other path-finding algorithms. Then, we will implement the A* path-finding algorithm to a dataset of word vectors. Finally, we will review some output examples. Let’s get started!
Dijkstra’s algorithm is a method for finding the shortest path between any two vertices of a graph. This algorithm operates by initially overestimating the distance of each vertex from the starting vertex. Then, it visits each node and its neighbors to find the shortest subpath using a greedy approach. The algorithm continues this process until it reaches the final vertex.
A* search algorithm
Next, we have the A* algorithm. A* is a best-first search algorithm. It takes into consideration three parameters: the cost of moving from the initial state to the current state (g), the estimated cost of moving from the current state to the final state (h), and the sum of g and h (f). By utilizing these parameters, A* is able to efficiently find the shortest path between two vertices.
Implementing path finding between latent vectors
To begin, I trained a word2vec model using the IMDB Movie Reviews Dataset. The IMDB dataset is a collection of 50,000 movie reviews, designed for sentiment classification. The dataset is divided into 25,000 reviews for training and 25,000 for testing.
Next, I implemented the A* pathfinding algorithm, which you can find in my project here. I adapted the algorithm from WikiClassify, a project that later became the popular Chrome extension WikiPopularity.
Now that we’ve trained a word2vec model and implemented a path-finding algorithm, it’s time to see the results.
Our first example is between the words “animated” and “carrying.” The path between these two words demonstrates a progression from the idea of animation, often associated with movement and performance, to the concept of carrying, associated with movement and action.
animated -> carrying Explored: 68, Frontier: 169, Cost: 3.356
know -> tasty Explored: 11166, Frontier: 2618, Cost: 7.393
Next, we have the path between “know” and “tasty.” This demonstrates a progression from the idea of knowledge and understanding to the concept of taste and enjoyment.
tasty -> boxing Explored: 461, Frontier: 1305, Cost: 4.298
In this case, we have the shortest path between the words “animated” and “carrying.” The path taken includes various words related to the arts, such as “aesthetic” and “artistic,” as well as technical and visual words like “cinematic” and “technical.” The final word in the path, “boxing,” relates to chess as a form of sport.
In our final example, we have “animated” and “carrying”. This path highlights a clear transition from words associated with negative and violent actions, such as “threaten” and “killed”, to words associated with technical manipulations, such as “drawing” and “painting”. The final word is “restore”, which bridges the form used in the art sphere.
threaten -> restore Explored: 4820, Frontier: 2992, Cost: 6.216