Yi Zhu

Chatbot Arena and the Elo rating system - Part 2

2024-08-16T00:00:00-07:00

In our previous blog post on Elo rating system, we introduced the basics of the Elo rating system and its online linear update algorithm, hereafter referred to as “online Elo”. However, we identified a significant concern with online Elo: its instability and tendency to bias toward recent results. For example, a demonstration from Chatbot Arena showed substantial shifts in model ratings when Elo was recalculated using the reverse order of matches.

Online Elo ranking is not stable. Image source: notebook from Chatbot Arena

This instability is, of course, undesirable. While it’s true that player can learn from feedback and improve their skills (which means their ratings are changing), the assumption is that player’s improvements occur slowly. Therefore, in the short term, their ranks should stay relatively stable. In other words, no matter how we compute the score, we hope to get the same ranking despite the order of matches. So, how can we generate a stable and unique Elo ranking?

Why does this happen?

We touched on this issue briefly in our previous post, the online Elo algorithm adjusts a player’s rating incrementally after each match, considering the match outcome and the current ratings of the opponents. Each update subsequently influences future ratings. Therefore, the sequence in which matches are processed significantly impacts the final ratings. When we reverse the order of matches, what were initially earlier matches (now processed later) are updated based on different initial conditions, which can lead to significantly altered final ratings.

So if the problem lies in the sequential update, why not update the ratings all at once?

Maximum Likelihood Estimation with Bradley-Terry model

Basics

If we think about SGD again, we know that a good practice for stable optimization is to use mini-batch instead of feeding a single data sample during each model update. In the context of LLM evaluation, we often assume that the model’s capability remains static over the period being analyzed. This assumption thus allows us to utilize Maximum Likelihood Estimation (MLE) to directly fit the ratings globally.

But before we dive deeper, let’s clarify some fundamental concepts.

MLE might be familiar, it is a statistical method used to estimate the parameters of a model. In simpler terms, it identifies the set of parameters under which the observed data is most probable.
The Bradley-Terry model is a probability model used for predicting the outcomes of pairwise comparisons. In rating systems, it’s used to estimate the relative strengths of players based on the outcomes of their matches against each other. You can find out its math formulas in this wiki, not hard to understand (very similar to the online Elo formula we have introduced in previous post).

The connection between MLE and the Bradley-Terry model in the context of rating systems is quite direct. MLE is used to estimate the strength parameters of the Bradley-Terry model from the data of match results. By applying MLE, we can maximize the likelihood that the predicted outcomes under the Bradley-Terry model align with the actual observed outcomes of matches. But how exactly this is done? How is this different from online Elo?

Dive deep into MLE Elo

Let’s look at a python implementation below, which is borrowed from this notebook from Chatbot Arena.

def compute_mle_elo(
    df, SCALE=400, BASE=10, INIT_RATING=1000, sample_weight=None
):
    from sklearn.linear_model import LogisticRegression
    ptbl_a_win = pd.pivot_table(
        df[df["winner"] == "model_a"],
        index="model_a",
        columns="model_b",
        aggfunc="size",
        fill_value=0,
    )
    # if no tie, create a zero matrix
    if sum(df["winner"].isin(["tie", "tie (bothbad)"])) == 0:
        ptbl_tie = pd.DataFrame(0, index=ptbl_a_win.index, columns=ptbl_a_win.columns)
    else:
        ptbl_tie = pd.pivot_table(
            df[df["winner"].isin(["tie", "tie (bothbad)"])],
            index="model_a",
            columns="model_b",
            aggfunc="size",
            fill_value=0,
        )
        ptbl_tie = ptbl_tie + ptbl_tie.T

    ptbl_b_win = pd.pivot_table(
        df[df["winner"] == "model_b"],
        index="model_a",
        columns="model_b",
        aggfunc="size",
        fill_value=0,
    )
    ptbl_win = ptbl_a_win * 2 + ptbl_b_win.T * 2 + ptbl_tie

    models = pd.Series(np.arange(len(ptbl_win.index)), index=ptbl_win.index)

    p = len(models)
    X = np.zeros([p * (p - 1) * 2, p])
    Y = np.zeros(p * (p - 1) * 2)

    cur_row = 0
    sample_weights = []
    for m_a in ptbl_win.index:
        for m_b in ptbl_win.columns:
            if m_a == m_b:
                continue
            # if nan skip
            if math.isnan(ptbl_win.loc[m_a, m_b]) or math.isnan(ptbl_win.loc[m_b, m_a]):
                continue
            X[cur_row, models[m_a]] = +math.log(BASE)
            X[cur_row, models[m_b]] = -math.log(BASE)
            Y[cur_row] = 1.0
            sample_weights.append(ptbl_win.loc[m_a, m_b])

            X[cur_row + 1, models[m_a]] = math.log(BASE)
            X[cur_row + 1, models[m_b]] = -math.log(BASE)
            Y[cur_row + 1] = 0.0
            sample_weights.append(ptbl_win.loc[m_b, m_a])
            cur_row += 2
    X = X[:cur_row]
    Y = Y[:cur_row]

    lr = LogisticRegression(fit_intercept=False, penalty=None, tol=1e-6)
    lr.fit(X, Y, sample_weight=sample_weights)
    elo_scores = SCALE * lr.coef_[0] + INIT_RATING
    return pd.Series(elo_scores, index=models.index).sort_values(ascending=False)

This function include 3 stages. For the first stage until ptbl_win = ptbl_a_win * 2 + ptbl_b_win.T * 2 + ptbl_tie, we are preparing data tables from the match outcomes. ptbl_win basically combines wins and ties into a single table that effectively represents the interaction matrix between all pairs of models. The reason for doubling the count of wins in both ptbl_a_win and ptbl_b_win.T is to emphasize the importance of each win equally for both model_a and model_b, ensuring that no single win is underrepresented.

For the second stage until Y = Y[:cur_row], we are setting up logistic regression inputs. X is filled with the logarithm of the base, encoding who is competing against whom by setting corresponding model indices to positive or negative values. Y is set to 1 or 0, representing the outcome of the match (1 if the row corresponds to a win by model_a, 0 for model_b). Here, sample_weights is an array that collects the weight of each sample based on the number of times a particular match-up occurs.

For the last stage, we simply use logistic regression to fit the data. the final lr.coef_ contains the estimated skill levels of the players on the log-odds scale. These coefficients are the result of the optimization process that maximizes the likelihood of the observed match outcomes given the skill levels of the players. The final output is a series of Elo scores indexed by model names, sorted in descending order to show the strongest model at the top.

Unlike the sequential update in the online Elo system, MLE with the Bradley-Terry model considers all match outcomes simultaneously. This global optimization approach reduces the dependency on the order of matches, leading to more stable and consistent estimates of player strengths. In fact, if we reverse the order of matches or shuffle the order, the MLE Elo ranking will mostly stay the same.

Compute bootstrap confidence intervals

Although MLE Elo is simple and stable, in order to add robustness to analysis, people often use bootstrapping to compute confidence internals to understand the statistical significance. But hold on, what is confidence interval and what does bootstrap mean?

A confidence interval (CI) is a range of values, derived from the data, that is likely to contain the value of an unknown population parameter. For instance, in the context of Elo scores, a 95% confidence interval around a score indicates that, if the procedure were repeated many times, 95% of the confidence intervals calculated would contain the true Elo score. It gives an idea of the uncertainty around the estimated value.

Bootstrap is a powerful statistical technique used to estimate the uncertainty of a statistic (like a mean, median, or, in this case, Elo scores) by resampling the data. In essence, it involves repeatedly drawing samples, with replacement, from the observed data set and recalculating the statistic for each sample. This generates an empirical distribution of the statistic which can then be used to compute confidence intervals or test hypotheses. By using bootstrap, it assesses how stable the Elo scores are: if the matches (data points) had come out differently, would we get a similar ranking and score?

Let’s look at a python implementation below, which is borrowed from this notebook from Chatbot Arena.

def get_bootstrap_result(battles, func_compute_elo, num_round):
    rows = []
    for i in tqdm(range(num_round), desc="bootstrap"):
        rows.append(func_compute_elo(battles.sample(frac=1.0, replace=True)))
    df = pd.DataFrame(rows)
    return df[df.median().sort_values(ascending=False).index]
    
BOOTSTRAP_ROUNDS = 100
np.random.seed(42)
bootstrap_elo_lu = get_bootstrap_result(battles, compute_mle_elo, BOOTSTRAP_ROUNDS)

The implementation above is straightforward. Basically, it loops num_round times as defined by BOOTSTRAP_ROUNDS. In each iteration, it resamples the dataset with replacement battles.sample(frac=1.0, replace=True), which means it creates a new dataset the same size as the original but some battles may be repeated and others omitted. It then computes Elo scores using the provided function func_compute_elo for the resampled dataset and stores the result.

It’s true that some battles may be duplicated in a bootstrap sample, which could introduce bias if certain players are overrepresented in these duplicates. But the purpose of bootstrap resampling is to create many such samples and then analyze the distribution of the computed Elo ratings. The bias introduced by any single bootstrap sample is mitigated by averaging over many samples.

If we visualize the confidence intervals of all models, we can see that the statistical range using MLE ELo is small (always centers around its median).

MLE Elo ranking is more stable than online Elo. Image source: notebook from Chatbot Arena

However, if we change back to online Elo, we can see the statistical range suddenly increases a lot. In particular, their confidence intervals overlap significantly, which suggests that the difference in their Elo scores might not be statistically significant. In other words, the observed difference could be due to random variations in the data (like specific matchups, outliers, or other noise) rather than a true difference in model’s performance.

Online Elo ranking is not stable, the difference in models' Elo scores might not be statistically significant. Image source: notebook from Chatbot Arena

Some notes

Can I use MLE with Bradley-Terry model for data beyond pair-wise comparison?

Not really. If the comparisons are not pairwise, the Bradley-Terry model may not be applicable as it stands because it cannot process multiple comparisons simultaneously or scenarios where the structure of competition is not strictly one-on-one. For instance, in events or competitions where outcomes involve more than two participants at a time (like races or multi-player games), alternative models are needed.

Any other methods to compute elo?

There are many algorithms (at least extensions) to compute Elo score. One popular method, termed whole history rating (WHR), is specifically designed to more accurately reflect changes in a player’s performance over time.

Like we have mentioned, MLE Elo typically computes ratings based on the assumption that a player’s skill level is static or changes very slowly. It treats all matches with equal weight regardless of when they occurred. However, for some situations where player performance may change significantly over time, such as in long career sports or games, academic or professional development tracking, etc., it is better to explicitly model changes in player strength over time, so here comes the whole history rating.

For detailed implementations, we refer you to this great Github repo, which you can directly pip install and import whole_history_rating to compute. Simply put, WHR calculates ratings based on all past results, taking into account the time when each match was played. providing a more nuanced and responsive rating system that adjusts as more data becomes available.

from whr import whole_history_rating

whr = whole_history_rating.Base()

whr.create_game("shusaku", "shusai", "B", 1, 0)
whr.create_game("shusaku", "shusai", "W", 2, 0)
whr.create_game("shusaku", "shusai", "W", 3, 0)

whr.auto_iterate(time_limit=10, precision=1e-3, batch_size=10)

ratings = whr.get_ordered_ratings(current=True, compact=False)

This code snippet starts by importing the library and initializing the base WHR object. Then adding games to the system using create_game() method. It takes the names of the black and white players, the winner (‘B’ for black, ‘W’ for white), the day number, and an optional handicap (generally less than 500 elo). To emphasize, the day number here is to improve the temporal awareness. Finally, the WHR algorithm allows for iterative refinement to achieve accurate and stable ratings, and retrieve all ratings in order.

However for LLMs, their capabilities are relatively stable once they are done training. This is within the assumption of MLE Elo that player’s skill level is static or changes very slowly. That is why in most situations of LLM evaluation, people just use MLE Elo due to its simpleness and stableness.

Why do people compute win rate?

Win rate is a statistical measure that represents the percentage of games, matches, or competitions that a player wins over a certain period or across a series of events. It is calculated as the number of wins divided by the total number of matches played, often expressed as a percentage. Thus, it offers a straightforward, easily understood metric of success, which complements Elo rating to make the evaluation more comprehensive.

In addition, win rates can also highlight anomalies or interesting trends that Elo might not capture. For instance, a player might have a high Elo rating but a surprisingly low win rate, like in a scenario that the player did 50 matches,

Wins: 15 (Against top 20 players mostly)
Losses: 35 (Loses more frequently to lower-ranked players)
Win Rate: 30%
Elo Rating: High (Due to wins against top players)

This could indicate issues such as inconsistency in performance or also have psychological factors at play, such as greater motivation or focus against better-known players, while underestimating or lacking the same drive against lower-ranked players.

On the other hand, we can use win rate to gain insight into the accuracy and quality of the Elo rating system. This is because utilizing Elo ratings allows us to predict win probabilities. If the predicted win rate match closely with the actual win rate, it means the Elo rating system is is high quality.

Summary

At this point, we have discussed MLE Elo, win rate, and confidence interval by bootstrapping, this should cover most of the evaluation scenarios, not just for LLM.

Chatbot Arena and the Elo rating system - Part 1

2024-06-20T00:00:00-07:00

Chatbot Arena, developed by members from LMSYS and UC Berkeley SkyLab, is a benchmark platform designed to evaluate large language models (LLMs) through anonymous, randomized battles in a crowdsourced environment. Launched in May 2023, it has been continuously updated to reflect the latest advancements in the field. The platform’s leaderboard is widely regarded as one of the most credible sources for ranking LLMs. The screenshot below highlights the competitive landscape featuring major players in the LLM space.

A screenshot of the leaderboard of Chatbot Arena as of 06/20/2024

But how do we get this leaderboard? What exactly is the Arena Elo score? Is this ranking manually decided by a panel of experts? How can a newly released model get so many votes and climb the ranks so quickly? And why do people trust this leaderboard so much? The answer to all these questions lies in the Elo rating system, a fascinating method used to rank competitors in a variety of games and sports.

In this blog post, we’ll dive into the Elo rating system and explore how it works. This is the first part of a series where we’ll break down the basics and show you why it’s such a popular way to rank players – and chatbots!

Why Elo rating system?

In the quest to identify the best models or to determine which ones outperform others, an impartial and reliable leaderboard becomes essential. One way to build such a leaderboard is if you can compute some metrics like accuracy and simply rank the models based on accuracy scores from high to low. Benchmarking LLMs, however, poses significant challenges due to the open-ended nature of user queries. Traditional metrics fall short in evaluating these models automatically, as they must account for numerous perspectives and the complexities of nuanced responses.

A screenshot of ChatGPT.com which shows the open-ended nature of user queries

While some literature suggests using AI models as judges—like the popular MTBench [1], which utilizes GPT-4 as the evaluator—this approach has limitations. AI judges often struggle to grasp the subtleties in long and complex responses, especially in real-world use cases. This is because they don’t have feelings, motives and values like humans do. Simply put, they are not perfectly aligned with us yet. Thus, human evaluation remains indispensable. Platforms like Chatbot Arena leverage crowdsourcing to facilitate pair-wise comparisons, where models are pitted against each other in “battles” to determine which one performs better.

To convert these pair-wise comparisons into a meaningful ranking, we turn to the Elo rating system. The Elo rating system is particularly well-suited for this purpose due to its advantageous properties in benchmarking based on pairwise comparisons:

Scalability: The Elo rating system can handle a large number of models efficiently. It doesn’t require extensive data for every possible model pair, making it feasible to benchmark numerous models.
Incrementality: New models can be evaluated with relatively few trials. This feature allows for quick integration and assessment of new entries into the ranking system.
Unique Order: The Elo system provides a clear, unique ranking for all models. For any two models, it can determine which one ranks higher or if they are tied, ensuring a straightforward and comprehensible leaderboard.

By leveraging the Elo rating system, we can maintain a dynamic and accurate leaderboard that reflects the performance of various models based on comprehensive pair-wise comparisons.

What is Elo rating system?

The Elo rating system is a widely recognized method for calculating the relative skill levels of players in zero-sum games, including chess, e-sports, and now, LLM evaluations. A zero-sum game, in game theory, is a situation where the total amount of resources available is fixed. Any gain by one player results in a loss by another player, meaning the sum of the gains and losses among all players is zero. For one player to succeed, others must fail.

The Queen's Gambit. Image source: tumblr

In the context of LLM competitions, the Elo rating system can be used to evaluate and rank models based on their performance in head-to-head comparisons. Intuitively, there are three steps:

Initial Scores: Every model starts with an initial score, commonly set at 1000.
Competitions: When two models compete, and model A’s response is preferred over model B’s response, model A “wins” and takes some points from model B.
Score Adjustments: After numerous matches, models that consistently perform well and align better with human preferences (e.g., like GPT-4) will have higher scores than their initial rating. Conversely, models that perform poorly will have lower scores, as they lose points to stronger models. This will naturally lead to the leaderboard ranking.

But how exactly this works? How many scores should model A take from model B? How does the system work for multiple models and update their rankings in a continuous and stable manner?

Dive deep into the Elo rating system

To answer the above questions, let’s look at the simplest online linear update algorithm to compute Elo ratings. The python implementation can be seen below, which is borrowed from this notebook from Chatbot Arena.

def compute_online_elo(battles, K=4, SCALE=400, BASE=10, INIT_RATING=1000):
    rating = defaultdict(lambda: INIT_RATING)

    for rd, model_a, model_b, winner in battles[['model_a', 'model_b', 'winner']].itertuples():
        ra = rating[model_a]
        rb = rating[model_b]
        ea = 1 / (1 + BASE ** ((rb - ra) / SCALE))
        eb = 1 / (1 + BASE ** ((ra - rb) / SCALE))
        if winner == "model_a":
            sa = 1
        elif winner == "model_b":
            sa = 0
        elif winner == "tie" or winner == "tie (bothbad)":
            sa = 0.5
        else:
            raise Exception(f"unexpected vote {winner}")
        rating[model_a] += K * (sa - ea)
        rating[model_b] += K * (1 - sa - eb)

Given a collection of battle results battles, we loop through them to update models’ rankings. For each battle, we first compute the expected outcome for each model, denoted as ea and eb. Then we compare these expected outcomes with the actual competition results sa, and update the models’ ratings rating[model_a] and rating[model_b] respectively. There are two key parts we need to elaborate: (1) computing the expected outcome for each model, and (2) updating the models’ ratings.

Computing the expected outcome for each model

First of all, why do we want to compute the expected outcome? The expected outcome is crucial because it allows us to quantify the probability of each model winning based on their current ratings. This probabilistic approach ensures that the rating adjustments are fair and proportional to the models’ performance expectations. If a highly-rated model wins against a lower-rated model, the rating change should be smaller because the outcome is expected. Conversely, if an underdog wins, the rating change should be more significant, reflecting the surprising result. This will help to stabilize the ranking system. For example, GPT4 can win over most models in most cases, but given its expected win rate is high, the actual rating change is minimal. Otherwise, slightly stronger models would quickly achieve extremely high scores, while slightly weaker models would be quickly eliminated.

Second, why we use this formula to compute the expected outcome, e.g., the expected outcome of model A ea = 1 / (1 + BASE ** ((rb - ra) / SCALE))? This formula is derived from the logistic distribution and is designed to provide a smooth, continuous function that maps rating differences to probabilities of winning. Talking about probabilities, this implies that the value range for ea and eb is between 0 and 1.

When rb is much higher than ra, the denominator will become very large, which leads to a very small ea towards 0.
When rb is much lower than ra, this part BASE ** ((rb - ra) / SCALE) will become very small almost 0, so that the denominator will converge to 1. Then ea is simply 1, which is the upper bound.
When ra = rb, ea would be 1 / (1 + 1) = 0.5, indicating an equal chance of winning.

To summarize, as the rating difference increases, the expected outcome skews towards the higher-rated model. The choice of BASE = 10 and SCALE = 400 is conventional and ensures that a rating difference of 400 points corresponds to a 10-to-1 expected win ratio. This scale factor makes the system intuitive and easy to interpret.

Updating the models’ ratings

Once we have the expected outcomes and the actual results from the battles, we are able to update the models’ ratings. The rating update formula is:

rating[model_a] = ra + K * (sa - ea)

where:

ra is the original rating of model A before the battle.
K indicates the maximum change in rating (commonly set to 32 for chess but can vary). The default value from Chatbot Arena sets K=4 because they want to make the Elo ratings more stable and less biased towards recent games. We will talk about this bias problem in just one minute.
sa represents the actual result after the game (1 for a win, 0.5 for a draw, 0 for a loss).

One interesting thing is if you look at the formula closely, you will find that the lower-rated player will also gain a few points from the higher-rated player in the event of a draw. This means that this rating system is self-correcting. Players whose ratings are too low or too high should, in the long run, do better or worse correspondingly than the rating system predicts and thus gain or lose rating points until the ratings reflect their true playing strength.

Another interesting point is, this formula looks very similar to the update rule used in Stochastic Gradient Descent (SGD). In SGD, the update rule is:

w' = w − η * ∇L(w)

where:

w represents the old model weights and w' represents the updated model weights
η is the learning rate
∇L(w) is the gradient of the loss function with respect to the model parameters.

Comparatively, the Elo rating update rule can be seen as:

The rating of the model ra is analogous to the model parameters w in SGD
The scaling factor K is similar to the learning rate η, controlling the step size
The score difference sa - ea is akin to the gradient, representing the error between the actual (model prediction) and expected outcomes (ground truth).

This similarity illustrates how the Elo rating system can be viewed as an iterative optimization process. Just like in machine learning where models improve through training, the Elo rating system allows models to “learn” from each match, progressively refining their ratings.

Some notes

While the Elo rating system is widely used and effective in many contexts, there are some interesting notes worth discussing.

Is Elo ranking an absolute metric?

Elo ratings are comparative only, and are valid only within the rating pool in which they were calculated, rather than being an absolute measure of a player’s strength. Your rating this year may not be the same as you rating next year, even your ability stays the same. A player with a high Elo ranking just means they are good in this pool, but not necessarily mean they are good universally.

The Queen's Gambit. Image source: yarn clips

Bias towards recent battles

In terms of the online linear update algorithm to compute Elo ratings, it is sensitive to recent results, because the ratings are updated sequentially, meaning each new result builds on the last updated rating. This sequential dependency can amplify the impact of recent results, especially if they deviate significantly from expected outcomes. This can lead to significant fluctuations in ratings if the rating update order is changed.

To demonstrate it, in the notebook from Chatbot Arena, they recompute Elo rating by using the reversed game order and observe significant difference due to online update of Elo which biases the recent games. We can see that when the order is reversed, the winner changes from gemini-1.5-pro-api-0409-preview to gpt-4o-2024-05-13. The ratings of all other models are also changed significantly.

Online Elo ranking is not stable. Image source: notebook from Chatbot Arena

Sensitivity to matchmaking

The rating updates depend heavily on the matchmaking process. If matches are not balanced (e.g., pairing high-rated models against very low-rated models frequently), the ratings can become distorted. Hence, the matchmaking process should also be handled carefully.

Chatbot Arena mentioned that they had adopted several different matching and sampling algorithms. They employed uniform sampling as well as weighted sampling methods, which assign greater weights to better models. That is probably why we can see the new models like GPT-4o gets to the top of the leaderboard soon after its release.

Summary

So, next time you need to rank something and find yourself without a clear metric, remember the Elo rating system. It’s a proven approach that can turn a series of individual comparisons into a meaningful and dynamic leaderboard.

References

[1] Zheng et al., Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena, NeurIPS 2023 Datasets and Benchmarks Track.