Want to boost sales and stand out from the competition? Implementing a recommendation system can be the key. In this blog post, we’ll show you how to build a product recommendation system using the Python programming language.
We’ll be using a modified version of the popular word2vec algorithm to measure the similarity between products, helping you to recommend items that are often purchased together.
Gathering the data
To start building our recommendation system, we need to gather the data. We’ll be using the Instacart Market Basket Analysis dataset. This dataset will be divided into three different files: departments.tsv, aisles.tsv, and orders.tsv.
Each line of these files starts with the corresponding department, aisle, and order ID, and then includes a list of all the associated products.
To read these files into our system, we’ll be using the gensim library. Gensim has a great implementation of the doc2vec algorithm, which we’ll be using to measure the similarity between products. Additionally, it has a well-integrated way of keeping track of tagged documents.
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
We’ll create a method that reads these files into Gensim’s Doc2vec class. This method will also include a doc_type
argument that allows us to set contextual document IDs.
def get_documents(file_path,doc_type='order'):
documents = []
for line in open(file_path):
doc_id, cart = line.strip().split('t')
products = cart.split()
documents.append(TaggedDocument(products,['{0}_{1}'.format(doc_type,doc_id)]))
return documents
We also created a product_ids.tsv
file to easily convert between a product ID and the corresponding description.
def decode_product_id(product_id):
product_ids = {}
for i,line in enumerate(open('product_ids.tsv')):
product_id, product_name = line.strip().split('t')
product_ids[product_id] = product_name
return product_ids[product_id]
Training our model
Now that we have gathered our data, we can begin training our recommendation system model.
We’ll start by initializing the model. I used a vector size of 300 and a window size of 6563. The window size is set to 6563 because it’s the same size as the largest department. I preserved the add_to_cart_order
in my orders.tsv
file, so reducing this number may be beneficial.
model = Doc2Vec(vector_size=300, window=6563, min_count=1, workers=16, epochs=1)
Next, I created a training method that allows us to retrain the model with different TaggedDocuments.
def train_model(model,documents,model_name,epochs=1):
print('Training...')
for epoch in range(epochs):
print('tEpoch {0}'.format(epoch))
model.train(documents,total_examples=len(documents),epochs=1)
print('ttSaving...')
fname = './checkpoints/{0}-{1}.w2v'.format(model_name,epoch)
model.save(fname)
return model
I found that light pre-training on departments and aisles, followed by full training on orders led to better results.
print('Pre-training on departments...')
documents = get_documents('departments.tsv','department')
print('tBuilding vocab...')
model.build_vocab(documents)
model = train_model(model,documents,'department',8)
print('Pre-training on aisles...')
documents = get_documents('aisles.tsv','aisle')
model = train_model(model,documents,'aisle',11)
print('Training on orders...')
documents = get_documents('orders.tsv','order')
model = train_model(model,documents,'order',50)
Normally the vocab would be built after gathering all the documents so the TaggedDocument IDs can be queried, but we had to do it this way due to memory constraints.
Product recommendation
Now that our model is trained, we can use it to recommend products.
Since each product maps to a latent vector, we can query the nearest neighbors for any product. This means that products that are close together in vector space will also be close together in our store layout.
print('nTop 10 closest to {0}:'.format(decode_product_id(product_id)))
for product_id, similarity in model.most_similar(product_id,topn=10):
product_description = decode_product_id(product_id)
print('{0:0.2f}t{1}'.format(similarity,product_description))
Chicken Tenders
0.34 | Boneless Skinless Chicken Breast Fillets |
0.28 | The Ultimate Fish Stick |
0.28 | Freshly Shredded Parmesan Cheese |
0.28 | Smoked Turkey Bacon |
0.28 | Pork Loin Tenderloin |
0.27 | Mini-Snacks Organic Raisins |
0.27 | Shredded Iceberg Lettuce |
0.27 | Fresh Cut Cauliflower & Broccoli Steam in Bag |
0.27 | Low Salt Ham |
0.25 | Sourdough Bachelor Loaf |
Hand Picked Pomegranate Seeds/Arils
0.48 | Organic Baby Spinach Salad |
0.47 | Mini Seedless Cucumbers |
0.44 | Butternut Squash Noodles |
0.42 | Potted Basil |
0.41 | Organic Apple Rings |
0.41 | Herbs Chives |
0.40 | Organic Mint Bunch |
0.40 | Organics Spinach |
0.39 | Organic Cilantro Bunch |
0.38 | Shishito Peppers |
Gingerbread Spice Herbal Tea
0.33 | Scarlet Citrus Rooibos Herbal Tea Filterbags |
0.29 | Lemon Zinger Herbal Tea |
0.29 | Cinnamon Apple Spice Herb Tea |
0.27 | Lemon Ginger Tea Bags |
0.27 | Caffeine Free Jammin’ Lemon Ginger Herbal Tea |
0.26 | Ginger Peach Decaf Longevity Tea Bags |
0.26 | Caffeine Free Peppermint Herbal Tea |
0.26 | Caramel Apple Dream Herbal Tea |
0.26 | Earl Greyer Black Tea Bags |
0.26 | Fanta Zero Sugar Free Orange Soda Fridge Pack |
Old Fashioned Rolled Oats
0.37 | Quick 1 Minute Oatmeal |
0.30 | Quick 1 Minute Whole Grain Oats |
0.28 | Old Fashioned Oats |
0.28 | Golden Raisins |
0.27 | Whole Ground Flaxseed Meal |
0.27 | Whole Grain Oat Cereal |
0.27 | Natural California Raisins |
0.27 | Slivered Almonds |
0.27 | Oat Bran |
0.27 | Quaker Life Cinnamon Cereal |
White Chocolate Baking Chips
0.37 | Organic Powdered Sugar |
0.36 | Organic Vanilla Extract |
0.32 | Cane Sugar |
0.31 | Organic Ranch Dressing |
0.30 | Organic Baking Cocoa |
0.30 | Unsalted Butter |
0.30 | Semi-Sweet Chocolate Baking Bars |
0.29 | Organic Heavy Whipping Cream |
0.29 | Natural Food Color Packets |
0.29 | Light Brown Sugar |
I’m not sure why the similarity is closer to 0 than to 1 on these. Using model.most_similar_cosmul()
may provide better results.
Order similarity
One of the advantages of using the doc2vec algorithm is that it allows us to not only query the nearest words, but also the nearest documents. This means that we can also query the similarity between orders.
order_1 = ['24697','27787','29165','36381']
print('Order 1:')
for product_id in order_1:
print(decode_product_id(product_id))
order_2 = ['976','2307','2617','4588']
print('nOrder 2:')
for product_id in order_2:
print(decode_product_id(product_id))
similarity = model.docvecs.similarity_unseen_docs(model,order_1,order_2)
print('nSimilarity:t{0:0.2f}'.format(similarity))
First example:
Order 1:
Ultrabrite All-in-One Advanced Whitening Clean Mint Toothpaste
Complete Deep Clean Soft Toothbrush
Original Antiseptic Adult Mouthwash
Cool Mint Floss
Order 2:
Cavity Protection Regular Flavor Toothpaste
Pro-Health All-In-One Medium Toothbrushes
Total Care Mouthwash Icy Clean Mint
Oral B Glide Cool Mint Deep Clean Floss
Similarity: 0.03
Now let’s try something less similar:
Order 1:
Ultrabrite All-in-One Advanced Whitening Clean Mint Toothpaste
Complete Deep Clean Soft Toothbrush
Original Antiseptic Adult Mouthwash
Cool Mint Floss
Order 2:
Light Tuna Chunk In Water
Organic Capellini Whole Wheat Pasta
Rice Sides Rice Medley
Angus Beef Smoked Sausage
Similarity: -0.11
Conclusion
Building a product recommendation system is a great way to increase sales and stand out from the competition. By tracking which items are purchased together, you can easily implement a recommendation system. With the right data, you can quickly create effective recommendations based on the similarity between products. We hope you found this tutorial useful, thank you for reading!