Multiple Choice Normalization in LM Evaluation
There are multiple ways of evaluating multiple choice tasks on autoregressive LMs like GPT-3/Neo/J. This post lays out the current prevalent normalization methods.
Let $x_{0:m}$ be the prompt, and $x_{m:n_i}$ be the $i$th possible continuation with a token length of $n_i - m$. There are several ways to use a language model to rank multiple possible continuations to a prompt. Since the language model only gives (log) probabilities for the next token given the context (i.e $\log \mathbb P(x_i|x_{0:i})$), there is ambiguity in handling scoring for arbitrary continuations. The following are several possible ways to resolve this problem:
-
Unnormalized: The score of continuation $i$ is determined using $\sum_{j=m}^{n_i - 1} \log \mathbb P(x_j|x_{0:j})$. Intuitively, this is the probability of a generation sampled from the prompt containing the continuation in question. While this is the simplest method, problems arise when there are significant differences in length between different continuations, as longer continuations tend to have lower log probabilities, thus biasing the language model towards picking shorter continuations. This approach is used by eval harness in all multiple choice tasks and presented as acc.
-
Token-length normalized: The score of continuation $i$ is determined using $\sum_{j=m}^{n_i - 1} \log \mathbb P(x_j|x_{0:j}) / (n_i - m)$. This approach attempts to normalize for length by computing average log probability per token; however, this approach is not tokenization agnostic, and as such two models with different tokenization that assign the same log likelihood to every single input string will have different token-length normalized scores. This approach is used by GPT-3 in most tasks. Eval harness does not report this score because it violates the design principle that all tasks should be tokenization independent.
-
Byte-length normalized: The score of continuation $i$ is determined using $\sum_{j=m}^{n_i - 1} \log \mathbb P(x_j|x_{0:j}) / \sum_{j=m}^{n_i - 1} L_{x_j}$, where $L_{x_j}$ is the number of bytes represented by the token $x_j$. This approach attempts to normalize for length by computing average log probability per character, which ensures that it is tokenization agnostic. This approach is also used by eval harness in all multiple choice tasks and presented as acc_norm.
-
Unconditional likelihood normalized: The score of continuation $i$ is determined using $\sum_{j=m}^{n_i - 1} \log \mathbb P(x_j|x_{0:j}) - \log \mathbb P(x_j)$. Intuitively, this approach measures the amount that the prompt increases the model's probability of outputting each continuation from the probability of the model unconditionally producing that continuation. This approach is used by GPT-3 in select tasks (ARC, OpenBookQA, and RACE), though no justification for why only these tasks in particular use this method is provided other than that this improves performance.
The unnormalized, token-length normalized, and byte-length normalized metrics can be computed without additional LM calls. The unconditional likelihood normalized metric requires an additional LM call to obtain the unconditional likelihood.
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
valuation
What is an MCP proxy and why does it need an approval layer?
MCP (Model Context Protocol) lets AI agents call external tools. A database query, a file write, an API call -- the agent decides what to do and the MCP server executes it. But there's nothing in the spec that evaluates whether that action should happen. An MCP proxy sits between the agent and the MCP server. It intercepts every tools/call request, does something with it, and forwards it (or doesn't). The proxy pattern isn't new -- it's how HTTP proxies, API gateways, and service meshes work. Apply it to MCP and you get an enforcement point for agent actions. Why a plain proxy isn't enough Most MCP proxies today do routing, load balancing, or observability. They watch traffic. Some log it. A few do rate limiting. None of that stops an agent from running DROP TABLE customers if the tool cal

90 Autonomous Runs: What an AI Agent Society Actually Looks Like
90 Autonomous Runs: What an AI Agent Society Actually Looks Like Most posts about AI agents show the happy path: tool calls work, chains complete, outputs are impressive. This is the other story. The one where the agent ran 90 times, mostly unsupervised, and the results are messy, honest, and more useful than any demo. What This Is Fermi is an autonomous agent society — 8 specialized AI agents that run on a schedule, each with a domain, veto power, and persistent memory. The main agent (Fermi) wakes up, reads its memory files, decides what to do, executes, evaluates itself, and goes back to sleep. Between runs, it has zero experience — only what it wrote down. No vector databases. No fine-tuning. No RAG. Just structured markdown files, a 5-phase cycle (REFLECT, PLAN, ACT, EVALUATE, REST),
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Models

Vectorless RAG: How I Built a RAG System Without Embeddings, Databases, or Vector Similarity
A journey from “vector similarity ≠ relevance” to building a reasoning-based RAG system that actually understands documents Photo by Becca Tapert on Unsplash Introduction Retrieval-Augmented Generation (RAG) has become a foundational pattern for building AI systems that can answer questions over private data. Traditionally, RAG relies on vector embeddings to retrieve relevant chunks of text, which are then passed to a language model for generation. However, as systems scale and use cases become more complex, a new paradigm is emerging: Vectorless RAG , also known as reasoning-based retrieval . Instead of relying on embeddings and similarity search, vectorless RAG navigates information like a human would — following structure, reasoning step-by-step, and dynamically deciding where to look n






Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!