 # Applying A* path finding to latent word vectors

## Path finding algorithms

There are many algorithms for finding the shortest path between two points, but few are better known than Dijkstra’s algorithm and the A* algorithm.

### Dijkstra’s algorithm

Dijkstra’s algorithm is often described as being “shortest-path first”. This is a greedy approach that isn’t suitable for large graphs.

Source: wikimedia.org

### A* search algorithm

A*, Weighted A*. Source: wikimedia.org

A* is a best-first search algorithm. In cases when it is available, it uses a heuristic to prioritize search. The resulting speed-up has made the A* search algorithm very popular.

## Applying A* to word vectors

In my last post, I discussed how word2vec can be used to qualitatively fingerprint sets of items. Since word vectors can be thought of as spatial coordinates, then a shortest path of word vectors can be found between any two words.

After training on the IMDB Movie Reviews Dataset, this is the result:

```animated -> carrying
Explored: 68, Frontier: 169, Cost: 3.356
```

animated
…scripted
…written
…played
…plays
…playing
…working
carrying

```know -> tasty
Explored: 11166, Frontier: 2618, Cost: 7.393
```

know
…think
…thought
…felt
…feeling
…idea
…explanation
…food
…coffee
…dinner
…breakfast
tasty

```tasty -> boxing
Explored: 461, Frontier: 1305, Cost: 4.298
```

tasty
…aesthetic
…artistic
…cinematic
…visual
…technical
…chess
boxing

```threaten -> restore
Explored: 4820, Frontier: 2992, Cost: 6.216
```

threaten
…manages
…fails
…failed
…died
…killed
…kidnapped
…kidnapping
…manipulation
…drawing
…painting
restore

## Code

### Setup

The following is a snippet from Word segmentation in Python and Building a recommendation system using product embeddings.

```name = 'reviews'
vocab_size = 12000
epochs = 100

if not os.path.isfile('{0}.model'.format(name)):
spm.SentencePieceTrainer.Train('--input={0}.txt --model_prefix={0} --vocab_size={1} --split_by_whitespace=True'.format(name,vocab_size))

sp = spm.SentencePieceProcessor()

if not os.path.isfile('./checkpoints/{0}-{1}.w2v'.format(name,epochs-1)):
if not os.path.isfile('{0}_tokenized.tsv'.format(name)):
with open('{0}_tokenized.tsv'.format(name),'w+') as f:
for i,line in enumerate(open('{0}.txt'.format(name))):
ids = sp.EncodeAsIds(line)
if len(ids) > 10:
f.write('{0}\t{1}\n'.format(i,' '.join(str(x) for x in ids)))

model = Doc2Vec(vector_size=300, window=8, min_count=3, negative=5, workers=16, epochs=1)

documents = get_documents('{0}_tokenized.tsv'.format(name),name)
model.build_vocab(documents)
model = train_model(model,documents,name,epochs)

```

### A* search algorithm

Adapted from WikiClassify, a project that would later become the Chrome extension WikiPopularity.

```import heapq

class elem_t:
def __init__(self,value,parent=None,cost=None):
self.value = value
self.cost = cost
self.column_offset = 0
self.parent = parent

class PriorityQueue:
def __init__(self):
self._queue = []
self._index = 0

def push(self,item):
heapq.heappush(self._queue, (item.cost,self._index,item) )
self._index += 1

def pop(self):
index,item = heapq.heappop(self._queue)[1:]
return item

def length(self):
return len(self._queue)

def get_transition_cost(word1,word2,doc2vec_model):
return 1.0-float(doc2vec_model.similarity(word1,word2))

def a_star_search(start_word, end_word, doc2vec_model, branching_factor=60, weight=4.):

cost_list = {start_word:0}

frontier = PriorityQueue()
start_elem = elem_t(start_word,parent=None,cost=get_transition_cost(start_word,end_word,doc2vec_model))
frontier.push(start_elem)

path_end = start_elem
explored = []
while True:

if frontier.length() == 0:
break

current_node = frontier.pop()
current_word = current_node.value
explored.append(current_word)

if current_word == end_word:
path_end = current_node
break

neighbors = [x for x in doc2vec_model.most_similar(current_word,topn=branching_factor) if x!=current_word]
if neighbors == None:
continue

base_cost = cost_list[current_word]

for neighbor_word in neighbors:
if current_word == neighbor_word:
continue
cost = base_cost + get_transition_cost(current_word,neighbor_word,doc2vec_model)
new_elem = elem_t(neighbor_word,parent=current_node,cost=cost)
new_elem.column_offset = neighbors.index(neighbor_word)
if (neighbor_word not in cost_list or cost<cost_list[neighbor_word]) and neighbor_word not in explored:
cost_list[neighbor_word] = cost
new_elem.cost = cost + weight*get_transition_cost(neighbor_word,end_word,doc2vec_model)
frontier.push(new_elem)
print("Explored: "+str(len(explored))+", Frontier: "+str(frontier.length())+", Cost: "+str(base_cost)[:5],end='\r')
print('')
path = [path_end.value]
cur = path_end
while cur:
cur = cur.parent
if cur:
path.append(cur.value)
return path[::-1]
``` 