Building a recommendation system using product embeddings

For competitive, low-margin businesses, a store’s layout can be the difference between surviving and getting wiped out. To drive more sales, businesses are using recommendation systems in online stores, and data-driven nudging at brick-and-mortar locations.

How can purchasing data be turned into sales? We can use an unaltered version of the word2vec algorithm used in natural language processing, to compare each product’s similarity in the context of being purchased together.

Gathering the data

First, download the Instacart Market Basket Analysis dataset, and create three derivative files: departments.tsv, aisles.tsv, and orders.tsv. Each line starts with the corresponding department, aisle, and order ID, followed by a tab, and then a space-delimited list of all the associated products.

The gensim library has a great implementation of doc2vec, and a well-integrated way of keeping track of tagged documents.

from gensim.models.doc2vec import Doc2Vec, TaggedDocument

Let’s create a method for reading these files into gensim’s Doc2vec class. I added a doc_type argument for contextual document IDs.

def get_documents(file_path,doc_type='order'):
    documents = []
    for line in open(file_path):
        doc_id, cart = line.strip().split('\t')
        products = cart.split()
        documents.append(TaggedDocument(products,['{0}_{1}'.format(doc_type,doc_id)]))
    return documents

I also created a product_ids.tsv file to easily convert between a product ID and the corresponding description.

def decode_product_id(product_id):
    product_ids = {}
    for i,line in enumerate(open('product_ids.tsv')):
        product_id, product_name = line.strip().split('\t')
        product_ids[product_id] = product_name
    return product_ids[product_id]

Training our model

Now let’s initialize our model. I used a vector size of 300, but you can make this whatever you want. The window size is 6563, because that is the same size as the largest department. I preserved the add_to_cart_order in my orders.tsv file, so reducing this number may be beneficial.

model = Doc2Vec(vector_size=300, window=6563, min_count=1, workers=16, epochs=1)

Next, I created a training method, so the model could be retrained multiple times with different TaggedDocuments.

def train_model(model,documents,model_name,epochs=1):
    print('Training...')
    for epoch in range(epochs):
        print('\tEpoch {0}'.format(epoch))
        model.train(documents,total_examples=len(documents),epochs=1)
        print('\t\tSaving...')
        fname = './checkpoints/{0}-{1}.w2v'.format(model_name,epoch)
        model.save(fname)
    return model

I found that light pre-training on departments and aisles, and then full training on orders led to better results.

print('Pre-training on departments...')
documents = get_documents('departments.tsv','department')

print('\tBuilding vocab...')
model.build_vocab(documents)

model = train_model(model,documents,'department',8)

print('Pre-training on aisles...')
documents = get_documents('aisles.tsv','aisle')
model = train_model(model,documents,'aisle',11)

print('Training on orders...')
documents = get_documents('orders.tsv','order')
model = train_model(model,documents,'order',50)

Normally you would build vocab after gathering all documents, so the TaggedDocument IDs can be queried, but I did not have enough RAM 🤷

Product recommendation

Since each product maps to a latent vector, we can query the nearest neighbors for any product. Eventually, we want products close together in vector space to be close together in the store layout.

print('\nTop 10 closest to {0}:'.format(decode_product_id(product_id)))
for product_id, similarity in model.most_similar(product_id,topn=10):
    product_description = decode_product_id(product_id)
    print('{0:0.2f}\t{1}'.format(similarity,product_description))

Chicken Tenders

0.34 Boneless Skinless Chicken Breast Fillets
0.28 The Ultimate Fish Stick
0.28 Freshly Shredded Parmesan Cheese
0.28 Smoked Turkey Bacon
0.28 Pork Loin Tenderloin
0.27 Mini-Snacks Organic Raisins
0.27 Shredded Iceberg Lettuce
0.27 Fresh Cut Cauliflower & Broccoli Steam in Bag
0.27 Low Salt Ham
0.25 Sourdough Bachelor Loaf

Hand Picked Pomegranate Seeds/Arils

0.48 Organic Baby Spinach Salad
0.47 Mini Seedless Cucumbers
0.44 Butternut Squash Noodles
0.42 Potted Basil
0.41 Organic Apple Rings
0.41 Herbs Chives
0.40 Organic Mint Bunch
0.40 Organics Spinach
0.39 Organic Cilantro Bunch
0.38 Shishito Peppers

Gingerbread Spice Herbal Tea

0.33 Scarlet Citrus Rooibos Herbal Tea Filterbags
0.29 Lemon Zinger Herbal Tea
0.29 Cinnamon Apple Spice Herb Tea
0.27 Lemon Ginger Tea Bags
0.27 Caffeine Free Jammin’ Lemon Ginger Herbal Tea
0.26 Ginger Peach Decaf Longevity Tea Bags
0.26 Caffeine Free Peppermint Herbal Tea
0.26 Caramel Apple Dream Herbal Tea
0.26 Earl Greyer Black Tea Bags
0.26 Fanta Zero Sugar Free Orange Soda Fridge Pack

Old Fashioned Rolled Oats

0.37 Quick 1 Minute Oatmeal
0.30 Quick 1 Minute Whole Grain Oats
0.28 Old Fashioned Oats
0.28 Golden Raisins
0.27 Whole Ground Flaxseed Meal
0.27 Whole Grain Oat Cereal
0.27 Natural California Raisins
0.27 Slivered Almonds
0.27 Oat Bran
0.27 Quaker Life Cinnamon Cereal

White Chocolate Baking Chips

0.37 Organic Powdered Sugar
0.36 Organic Vanilla Extract
0.32 Cane Sugar
0.31 Organic Ranch Dressing
0.30 Organic Baking Cocoa
0.30 Unsalted Butter
0.30 Semi-Sweet Chocolate Baking Bars
0.29 Organic Heavy Whipping Cream
0.29 Natural Food Color Packets
0.29 Light Brown Sugar

I’m not sure why the similarity is closer to 0 than to 1 on these. Using model.most_similar_cosmul() may work better.

Order similarity

The cool thing about doc2vec is that you can query not only the nearest words, but the nearest documents as well. By analogy, we can query similarity between orders.

order_1 = ['24697','27787','29165','36381']
print('Order 1:')
for product_id in order_1:
    print(decode_product_id(product_id))

order_2 = ['976','2307','2617','4588']
print('\nOrder 2:')
for product_id in order_2:
    print(decode_product_id(product_id))

similarity = model.docvecs.similarity_unseen_docs(model,order_1,order_2)
print('\nSimilarity:\t{0:0.2f}'.format(similarity))

First example:

Order 1:
Ultrabrite All-in-One Advanced Whitening Clean Mint Toothpaste
Complete Deep Clean Soft Toothbrush
Original Antiseptic Adult Mouthwash
Cool Mint Floss

Order 2:
Cavity Protection Regular Flavor Toothpaste
Pro-Health All-In-One Medium Toothbrushes
Total Care Mouthwash Icy Clean Mint
Oral B Glide Cool Mint Deep Clean Floss

Similarity:	0.03

Now let’s try something less similar:

Order 1:
Ultrabrite All-in-One Advanced Whitening Clean Mint Toothpaste
Complete Deep Clean Soft Toothbrush
Original Antiseptic Adult Mouthwash
Cool Mint Floss

Order 2:
Light Tuna Chunk In Water
Organic Capellini Whole Wheat Pasta
Rice Sides Rice Medley
Angus Beef Smoked Sausage

Similarity:	-0.11

About the author



Hi, I'm Nathan. I'm an electrical engineer in the Los Angeles area. Keep an eye out for more content being posted soon.


Leave a Reply

Your email address will not be published. Required fields are marked *