Make better choices with AI-assisted A/B testing

In this post, I will show how to use A/B testing to rank thousands of qualitative items, made easier through automation by large language models (LLMs).

The ELO algorithm


The ELO algorithm, originally designed for ranking chess players, is a well known method for determining rankings through a series of head-to-head matches.

The algorithm assigns ratings based on performance and adjusts them after each match depending on the outcome and the opponents’ ratings.

We will use the ELO algorithm to establish a scoring of different items.

Ranking colors using AI

My first attempt was to rank these 11 basic colors: “red”, “orange”, “yellow”, “green”, “blue”, “purple”, “pink”, “brown”, “white”, “gray”, and “black”.

Here is an example of a matchup between 2 colors, as written by a LLM. In my prompt, I asked it to weigh the pros and cons of each color before coming to a decision, so that the final answer was more thoughtful.


After performing a match for every color combination, the final score is as follows:

The code is at the bottom of this post!

Ranking stocks using AI

In order to make stocks ingestible by an LLM, the first step is to fetch a description of each stock.

I compiled descriptions from 5 different sources from the internet, and used AI to summarize all of the sources into a single paragraph about each stock.

Here is an example description:

I have 15,509 descriptions in total:

Matching stocks together

First, I created a pairing score, used only to pair stocks together (does not update after being set).

To get this score, I ran 10 million rounds of the ELO algorithm, where the stock with the highest market cap was decided the winner.

For pairing, I used random.sample to select 2 random stocks from the list of all stocks. I only continued with the round if the pairing score indicated that it was a close match.


The prompt I used for the A/B testing task is below.

Match results

After 10,000 matches, the result is a 34 MB file containing the results. This includes precise reasoning as to why the winner was selected!

The file can be repackaged into a financial report that is over 20,000 pages long.

Final ranking

I updated the initial scores with information from each match, and then took the change in rank compared to the initial rank. The companies shown have a market cap of at least $10 billion.




These 10,000 matches cost $150 (plus $10 in development costs), meaning that each match cost about 1.5 cents.

Code for ranking colors

import openai
import os
import random
import json

openai.api_key = "..."

items = ["black", "blue", "brown", "gray", "green", "orange", "pink", "purple", "red", "white", "yellow"]

if not os.path.isfile("matches.json"):
    matches = {}
    match_index = 0
    for item_1 in items:
        for item_2 in items:
            if item_1 == item_2:
            print(f"{item_1} vs {item_2}...")
            while True:
                    prompt = f'You are a human who has been asked to choose between the color {item_1} and the color {item_2}.\n\nRespond with your choice in JSON format, providing the strengths and weaknesses of each choice (without yet choosing one!), followed by a brief explanation for your choice, and then the choice on its own without an explanation.\n\nFormat your response as follows:\n\n```json\n{{\n  "{item_1}_pros": "...",\n  "{item_1}_cons": "...",\n  "{item_2}_pros": "...",\n  "{item_2}_cons": "...",\n  "choice_explanation": "...",\n  "choice": "..."\n}}\n```'
                    response = openai.ChatCompletion.create(
                        messages=[{"role": "system", "content": "You are a helpful assistant."},
                                  {"role": "user", "content": prompt}]
                    start_index = response.find("{")
                    end_index = response.rfind("}") + 1
                    match_data = json.loads(response[start_index:end_index])
                    if match_data["choice"] in items:
                        print(f"\tWinner: {match_data['choice']}!")
                        matches[f"{match_index}_{item_1},{item_2}"] = match_data
                        match_index += 1
                except KeyboardInterrupt:
                    raise KeyboardInterrupt
                except Exception as e:
    with open("matches.json", "w+") as f:
        json.dump(matches, f, indent=2)

with open("matches.json", "r") as f:
    match_results = json.load(f)

# Initialize ELO ratings
initial_rating = 1000
ratings = {item: initial_rating for item in items}

# ELO rating algorithm
def elo_rating(winner, loser, ratings, k=32):
    def expected_score(rating_a, rating_b):
        return 1 / (1 + 10 ** ((rating_b - rating_a) / 400))

    expected_winner = expected_score(ratings[winner], ratings[loser])
    expected_loser = expected_score(ratings[loser], ratings[winner])

    ratings[winner] += k * (1 - expected_winner)
    ratings[loser] += k * (0 - expected_loser)

# Update ELO ratings based on match results
for match, match_data in match_results.items():
    _, items_pair = match.split("_")
    item1, item2 = items_pair.split(",")
    winner = match_data["choice"]
    if winner == item1:
        elo_rating(item1, item2, ratings)
    elif winner == item2:
        elo_rating(item2, item1, ratings)
        print(f"Invalid match result: {match} -> {winner}")

# Print final ELO ratings
for item, rating in sorted(ratings.items(), key=lambda item: item[1], reverse=True):
    print(f"{item}: {rating:.2f}")

About the author

Hi, I'm Nathan. Thanks for reading! Keep an eye out for more content being posted soon.

Leave a Reply

Your email address will not be published. Required fields are marked *