Multiple Choice Normalization in LM Evaluation

EleutherAI Blogby Leo GaoOctober 11, 20213 min read1 views

There are multiple ways of evaluating multiple choice tasks on autoregressive LMs like GPT-3/Neo/J. This post lays out the current prevalent normalization methods.

Let $x_{0:m}$ be the prompt, and $x_{m:n_i}$ be the $i$th possible continuation with a token length of $n_i - m$. There are several ways to use a language model to rank multiple possible continuations to a prompt. Since the language model only gives (log) probabilities for the next token given the context (i.e $\log \mathbb P(x_i|x_{0:i})$), there is ambiguity in handling scoring for arbitrary continuations. The following are several possible ways to resolve this problem:

Unnormalized: The score of continuation $i$ is determined using $\sum_{j=m}^{n_i - 1} \log \mathbb P(x_j|x_{0:j})$. Intuitively, this is the probability of a generation sampled from the prompt containing the continuation in question. While this is the simplest method, problems arise when there are significant differences in length between different continuations, as longer continuations tend to have lower log probabilities, thus biasing the language model towards picking shorter continuations. This approach is used by eval harness in all multiple choice tasks and presented as acc.
Token-length normalized: The score of continuation $i$ is determined using $\sum_{j=m}^{n_i - 1} \log \mathbb P(x_j|x_{0:j}) / (n_i - m)$. This approach attempts to normalize for length by computing average log probability per token; however, this approach is not tokenization agnostic, and as such two models with different tokenization that assign the same log likelihood to every single input string will have different token-length normalized scores. This approach is used by GPT-3 in most tasks. Eval harness does not report this score because it violates the design principle that all tasks should be tokenization independent.
Byte-length normalized: The score of continuation $i$ is determined using $\sum_{j=m}^{n_i - 1} \log \mathbb P(x_j|x_{0:j}) / \sum_{j=m}^{n_i - 1} L_{x_j}$, where $L_{x_j}$ is the number of bytes represented by the token $x_j$. This approach attempts to normalize for length by computing average log probability per character, which ensures that it is tokenization agnostic. This approach is also used by eval harness in all multiple choice tasks and presented as acc_norm.
Unconditional likelihood normalized: The score of continuation $i$ is determined using $\sum_{j=m}^{n_i - 1} \log \mathbb P(x_j|x_{0:j}) - \log \mathbb P(x_j)$. Intuitively, this approach measures the amount that the prompt increases the model's probability of outputting each continuation from the probability of the model unconditionally producing that continuation. This approach is used by GPT-3 in select tasks (ARC, OpenBookQA, and RACE), though no justification for why only these tasks in particular use this method is provided other than that this improves performance.

The unnormalized, token-length normalized, and byte-length normalized metrics can be computed without additional LM calls. The unconditional likelihood normalized metric requires an additional LM call to obtain the unconditional likelihood.

Original source

EleutherAI Blog

https://blog.eleuther.ai/multiple-choice-normalization/

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

valuation

ProductsFresh

What is an MCP proxy and why does it need an approval layer?

MCP (Model Context Protocol) lets AI agents call external tools. A database query, a file write, an API call -- the agent decides what to do and the MCP server executes it. But there's nothing in the spec that evaluates whether that action should happen. An MCP proxy sits between the agent and the MCP server. It intercepts every tools/call request, does something with it, and forwards it (or doesn't). The proxy pattern isn't new -- it's how HTTP proxies, API gateways, and service meshes work. Apply it to MCP and you get an enforcement point for agent actions. Why a plain proxy isn't enough Most MCP proxies today do routing, load balancing, or observability. They watch traffic. Some log it. A few do rate limiting. None of that stops an agent from running DROP TABLE customers if the tool cal

DEV Community

4mabout 3 hours ago

Market NewsFresh

90 Autonomous Runs: What an AI Agent Society Actually Looks Like

90 Autonomous Runs: What an AI Agent Society Actually Looks Like Most posts about AI agents show the happy path: tool calls work, chains complete, outputs are impressive. This is the other story. The one where the agent ran 90 times, mostly unsupervised, and the results are messy, honest, and more useful than any demo. What This Is Fermi is an autonomous agent society — 8 specialized AI agents that run on a schedule, each with a domain, veto power, and persistent memory. The main agent (Fermi) wakes up, reads its memory files, decides what to do, executes, evaluates itself, and goes back to sleep. Between runs, it has zero experience — only what it wrote down. No vector databases. No fine-tuning. No RAG. Just structured markdown files, a 5-phase cycle (REFLECT, PLAN, ACT, EVALUATE, REST),

DEV Community

10mabout 3 hours ago

Market NewsRecent

A Look At Air Liquide (ENXTPA:AI) Valuation After New Taiwan Semiconductor Materials Plant Opening - simplywall.st

A Look At Air Liquide (ENXTPA:AI) Valuation After New Taiwan Semiconductor Materials Plant Opening simplywall.st

GNews AI Taiwan

1mabout 24 hours ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 211 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Models

ModelsFresh

Vectorless RAG: How I Built a RAG System Without Embeddings, Databases, or Vector Similarity

A journey from “vector similarity ≠ relevance” to building a reasoning-based RAG system that actually understands documents Photo by Becca Tapert on Unsplash Introduction Retrieval-Augmented Generation (RAG) has become a foundational pattern for building AI systems that can answer questions over private data. Traditionally, RAG relies on vector embeddings to retrieve relevant chunks of text, which are then passed to a language model for generation. However, as systems scale and use cases become more complex, a new paradigm is emerging: Vectorless RAG , also known as reasoning-based retrieval . Instead of relying on embeddings and similarity search, vectorless RAG navigates information like a human would — following structure, reasoning step-by-step, and dynamically deciding where to look n