 # Bridging concepts by applying pathfinding to word vectors

Society has unlocked vast amounts of value by harnessing the power of the graphs. First, we used graphs as infrastructure for running water, electricity, and the internet. Then, we used graphs as platforms for further applications. The refrigerator and washing machine are second-layer graph technologies, as are Google and Facebook.

The networks people are looking at next are based on knowledge discovery. In this post, we’ll look at how traditional pathfinding algorithms can be used for exploring latent space (as an alternative to interpolation).

## Pathfinding algorithms

There are many algorithms for finding the shortest path between two points, but few are better known than Dijkstra’s algorithm and the A* algorithm.

### Dijkstra’s algorithm

Dijkstra’s algorithm is often described as being “shortest-path first”. This is a greedy approach that isn’t suitable for large graphs.

Source: wikimedia.org

### A* search algorithm

A*, Weighted A*. Source: wikimedia.org

A* is a best-first search algorithm. In cases when it is available, it uses a heuristic to prioritize search. The resulting speed-up has made the A* search algorithm very popular.

## Applying A* to word vectors

In my last post, I discussed how word2vec can be used to qualitatively fingerprint sets of items. Since word vectors can be thought of as spatial coordinates, then a shortest path of word vectors can be found between any two words.

After training on the IMDB Movie Reviews Dataset, this is the result:

```animated -> carrying
Explored: 68, Frontier: 169, Cost: 3.356
```

animated
…scripted
…written
…played
…plays
…playing
…working
carrying

```know -> tasty
Explored: 11166, Frontier: 2618, Cost: 7.393
```

know
…think
…thought
…felt
…feeling
…idea
…explanation
…food
…coffee
…dinner
…breakfast
tasty

```tasty -> boxing
Explored: 461, Frontier: 1305, Cost: 4.298
```

tasty
…aesthetic
…artistic
…cinematic
…visual
…technical
…chess
boxing

```threaten -> restore
Explored: 4820, Frontier: 2992, Cost: 6.216
```

threaten
…manages
…fails
…failed
…died
…killed
…kidnapped
…kidnapping
…manipulation
…drawing
…painting
restore

## Code

### Setup

The following is a snippet from Word segmentation in Python and Building a recommendation system using product embeddings.

```name = 'reviews'
vocab_size = 12000
epochs = 100

if not os.path.isfile('{0}.model'.format(name)):
spm.SentencePieceTrainer.Train('--input={0}.txt --model_prefix={0} --vocab_size={1} --split_by_whitespace=True'.format(name,vocab_size))

sp = spm.SentencePieceProcessor()

if not os.path.isfile('./checkpoints/{0}-{1}.w2v'.format(name,epochs-1)):
if not os.path.isfile('{0}_tokenized.tsv'.format(name)):
with open('{0}_tokenized.tsv'.format(name),'w+') as f:
for i,line in enumerate(open('{0}.txt'.format(name))):
ids = sp.EncodeAsIds(line)
if len(ids) > 10:
f.write('{0}\t{1}\n'.format(i,' '.join(str(x) for x in ids)))

model = Doc2Vec(vector_size=300, window=8, min_count=3, negative=5, workers=16, epochs=1)

documents = get_documents('{0}_tokenized.tsv'.format(name),name)
model.build_vocab(documents)
model = train_model(model,documents,name,epochs)

```

### A* search algorithm

Adapted from WikiClassify, a project that would later become the Chrome extension WikiPopularity.

```import heapq

class elem_t:
def __init__(self,value,parent=None,cost=None):
self.value = value
self.cost = cost
self.column_offset = 0
self.parent = parent

class PriorityQueue:
def __init__(self):
self._queue = []
self._index = 0

def push(self,item):
heapq.heappush(self._queue, (item.cost,self._index,item) )
self._index += 1

def pop(self):
index,item = heapq.heappop(self._queue)[1:]
return item

def length(self):
return len(self._queue)

def get_transition_cost(word1,word2,doc2vec_model):
return 1.0-float(doc2vec_model.similarity(word1,word2))

def a_star_search(start_word, end_word, doc2vec_model, branching_factor=60, weight=4.):

cost_list = {start_word:0}

frontier = PriorityQueue()
start_elem = elem_t(start_word,parent=None,cost=get_transition_cost(start_word,end_word,doc2vec_model))
frontier.push(start_elem)

path_end = start_elem
explored = []
while True:

if frontier.length() == 0:
break

current_node = frontier.pop()
current_word = current_node.value
explored.append(current_word)

if current_word == end_word:
path_end = current_node
break

neighbors = [x for x in doc2vec_model.most_similar(current_word,topn=branching_factor) if x!=current_word]
if neighbors == None:
continue

base_cost = cost_list[current_word]

for neighbor_word in neighbors:
if current_word == neighbor_word:
continue
cost = base_cost + get_transition_cost(current_word,neighbor_word,doc2vec_model)
new_elem = elem_t(neighbor_word,parent=current_node,cost=cost)
new_elem.column_offset = neighbors.index(neighbor_word)
if (neighbor_word not in cost_list or cost<cost_list[neighbor_word]) and neighbor_word not in explored:
cost_list[neighbor_word] = cost
new_elem.cost = cost + weight*get_transition_cost(neighbor_word,end_word,doc2vec_model)
frontier.push(new_elem)
print("Explored: "+str(len(explored))+", Frontier: "+str(frontier.length())+", Cost: "+str(base_cost)[:5],end='\r')
print('')
path = [path_end.value]
cur = path_end
while cur:
cur = cur.parent
if cur:
path.append(cur.value)
return path[::-1]
``` 