Path finding algorithms
There are many algorithms for finding the shortest path between two points, but few are better known than Dijkstra’s algorithm and the A* algorithm.
Dijkstra’s algorithm
Dijkstra’s algorithm is often described as being “shortest-path first”. This is a greedy approach that isn’t suitable for large graphs.
Source: wikimedia.org
A* search algorithm
A*, Weighted A*. Source: wikimedia.org
A* is a best-first search algorithm. In cases when it is available, it uses a heuristic to prioritize search. The resulting speed-up has made the A* search algorithm very popular.
Applying A* to word vectors
In my last post, I discussed how word2vec can be used to qualitatively fingerprint sets of items. Since word vectors can be thought of as spatial coordinates, then a shortest path of word vectors can be found between any two words.
After training on the IMDB Movie Reviews Dataset, this is the result:
animated -> carrying Explored: 68, Frontier: 169, Cost: 3.356
animated
…scripted
…written
…played
…plays
…playing
…working
carrying
know -> tasty Explored: 11166, Frontier: 2618, Cost: 7.393
know
…think
…thought
…felt
…feeling
…idea
…explanation
…answers
…food
…coffee
…dinner
…breakfast
…bread
tasty
tasty -> boxing Explored: 461, Frontier: 1305, Cost: 4.298
tasty
…aesthetic
…artistic
…cinematic
…visual
…technical
…chess
boxing
threaten -> restore Explored: 4820, Frontier: 2992, Cost: 6.216
threaten
…manages
…fails
…failed
…died
…killed
…kidnapped
…kidnapping
…manipulation
…drawing
…painting
restore
Code
Setup
The following is a snippet from Word segmentation in Python and Building a recommendation system using product embeddings.
name = 'reviews' vocab_size = 12000 epochs = 100 if not os.path.isfile('{0}.model'.format(name)): spm.SentencePieceTrainer.Train('--input={0}.txt --model_prefix={0} --vocab_size={1} --split_by_whitespace=True'.format(name,vocab_size)) sp = spm.SentencePieceProcessor() sp.load('{0}.model'.format(name)) if not os.path.isfile('./checkpoints/{0}-{1}.w2v'.format(name,epochs-1)): if not os.path.isfile('{0}_tokenized.tsv'.format(name)): with open('{0}_tokenized.tsv'.format(name),'w+') as f: for i,line in enumerate(open('{0}.txt'.format(name))): ids = sp.EncodeAsIds(line) if len(ids) > 10: f.write('{0}\t{1}\n'.format(i,' '.join(str(x) for x in ids))) model = Doc2Vec(vector_size=300, window=8, min_count=3, negative=5, workers=16, epochs=1) documents = get_documents('{0}_tokenized.tsv'.format(name),name) model.build_vocab(documents) model = train_model(model,documents,name,epochs) model = Doc2Vec.load('./checkpoints/{0}-{1}.w2v'.format(name,epochs-1))
A* search algorithm
Adapted from WikiClassify, a project that would later become the Chrome extension WikiPopularity.
import heapq class elem_t: def __init__(self,value,parent=None,cost=None): self.value = value self.cost = cost self.column_offset = 0 self.parent = parent class PriorityQueue: def __init__(self): self._queue = [] self._index = 0 def push(self,item): heapq.heappush(self._queue, (item.cost,self._index,item) ) self._index += 1 def pop(self): index,item = heapq.heappop(self._queue)[1:] return item def length(self): return len(self._queue) def get_transition_cost(word1,word2,doc2vec_model): return 1.0-float(doc2vec_model.similarity(word1,word2)) def a_star_search(start_word, end_word, doc2vec_model, branching_factor=60, weight=4.): cost_list = {start_word:0} frontier = PriorityQueue() start_elem = elem_t(start_word,parent=None,cost=get_transition_cost(start_word,end_word,doc2vec_model)) frontier.push(start_elem) path_end = start_elem explored = [] while True: if frontier.length() == 0: break current_node = frontier.pop() current_word = current_node.value explored.append(current_word) if current_word == end_word: path_end = current_node break neighbors = [x[0] for x in doc2vec_model.most_similar(current_word,topn=branching_factor) if x!=current_word] if neighbors == None: continue base_cost = cost_list[current_word] for neighbor_word in neighbors: if current_word == neighbor_word: continue cost = base_cost + get_transition_cost(current_word,neighbor_word,doc2vec_model) new_elem = elem_t(neighbor_word,parent=current_node,cost=cost) new_elem.column_offset = neighbors.index(neighbor_word) if (neighbor_word not in cost_list or cost<cost_list[neighbor_word]) and neighbor_word not in explored: cost_list[neighbor_word] = cost new_elem.cost = cost + weight*get_transition_cost(neighbor_word,end_word,doc2vec_model) frontier.push(new_elem) print("Explored: "+str(len(explored))+", Frontier: "+str(frontier.length())+", Cost: "+str(base_cost)[:5],end='\r') print('') path = [path_end.value] cur = path_end while cur: cur = cur.parent if cur: path.append(cur.value) return path[::-1]