Live
Black Hat USADark ReadingBlack Hat AsiaAI BusinessCrack ML Interviews with Confidence: ML Model Development (20 Q&A)Towards AICoreWeave Stock Analysis: Buy or Sell This Nvidia-Backed AI Stock? - The Motley FoolGNews AI NVIDIAIntel Arc B70 Benchmarks/Comparison to Nvidia RTX 4070 SuperReddit r/LocalLLaMAI Gave Claude Access to My Desktop Outlook Without Touching the Microsoft APITowards AIBanning All Anthropic EmployeesHacker NewsMicrosoft is automatically updating Windows 11 24H2 to 25H2 using machine learning - TweakTownGoogle News: Machine Learning80 Years to an Overnight Success: The Real History of Artificial Intelligence - Futurist SpeakerGoogle News: AIGoogle’s Gemma 4 Tied Qwen 3.5 on Benchmarks. Then Won on One Word: Apache.Towards AIWhat next for the struggling rural mothers in China who helped to build AI?SCMP Tech (Asia AI)Apple reportedly signed a 3rd-party driver, by Tiny Corp, for AMD or Nvidia eGPUs for Apple Silicon Macs; it s meant for AI research, not accelerating graphics (AppleInsider)TechmemeBest Resume Builders in 2026: I Applied to 50 Jobs to Test TheseDEV CommunityTruth Technology and the Architecture of Digital TrustDEV CommunityBlack Hat USADark ReadingBlack Hat AsiaAI BusinessCrack ML Interviews with Confidence: ML Model Development (20 Q&A)Towards AICoreWeave Stock Analysis: Buy or Sell This Nvidia-Backed AI Stock? - The Motley FoolGNews AI NVIDIAIntel Arc B70 Benchmarks/Comparison to Nvidia RTX 4070 SuperReddit r/LocalLLaMAI Gave Claude Access to My Desktop Outlook Without Touching the Microsoft APITowards AIBanning All Anthropic EmployeesHacker NewsMicrosoft is automatically updating Windows 11 24H2 to 25H2 using machine learning - TweakTownGoogle News: Machine Learning80 Years to an Overnight Success: The Real History of Artificial Intelligence - Futurist SpeakerGoogle News: AIGoogle’s Gemma 4 Tied Qwen 3.5 on Benchmarks. Then Won on One Word: Apache.Towards AIWhat next for the struggling rural mothers in China who helped to build AI?SCMP Tech (Asia AI)Apple reportedly signed a 3rd-party driver, by Tiny Corp, for AMD or Nvidia eGPUs for Apple Silicon Macs; it s meant for AI research, not accelerating graphics (AppleInsider)TechmemeBest Resume Builders in 2026: I Applied to 50 Jobs to Test TheseDEV CommunityTruth Technology and the Architecture of Digital TrustDEV Community
AI NEWS HUBbyEIGENVECTOREigenvector

Multiple Choice Normalization in LM Evaluation

EleutherAI Blogby Leo GaoOctober 11, 20213 min read1 views
Source Quiz

There are multiple ways of evaluating multiple choice tasks on autoregressive LMs like GPT-3/Neo/J. This post lays out the current prevalent normalization methods.

Let $x_{0:m}$ be the prompt, and $x_{m:n_i}$ be the $i$th possible continuation with a token length of $n_i - m$. There are several ways to use a language model to rank multiple possible continuations to a prompt. Since the language model only gives (log) probabilities for the next token given the context (i.e $\log \mathbb P(x_i|x_{0:i})$), there is ambiguity in handling scoring for arbitrary continuations. The following are several possible ways to resolve this problem:

  • Unnormalized: The score of continuation $i$ is determined using $\sum_{j=m}^{n_i - 1} \log \mathbb P(x_j|x_{0:j})$. Intuitively, this is the probability of a generation sampled from the prompt containing the continuation in question. While this is the simplest method, problems arise when there are significant differences in length between different continuations, as longer continuations tend to have lower log probabilities, thus biasing the language model towards picking shorter continuations. This approach is used by eval harness in all multiple choice tasks and presented as acc.

  • Token-length normalized: The score of continuation $i$ is determined using $\sum_{j=m}^{n_i - 1} \log \mathbb P(x_j|x_{0:j}) / (n_i - m)$. This approach attempts to normalize for length by computing average log probability per token; however, this approach is not tokenization agnostic, and as such two models with different tokenization that assign the same log likelihood to every single input string will have different token-length normalized scores. This approach is used by GPT-3 in most tasks. Eval harness does not report this score because it violates the design principle that all tasks should be tokenization independent.

  • Byte-length normalized: The score of continuation $i$ is determined using $\sum_{j=m}^{n_i - 1} \log \mathbb P(x_j|x_{0:j}) / \sum_{j=m}^{n_i - 1} L_{x_j}$, where $L_{x_j}$ is the number of bytes represented by the token $x_j$. This approach attempts to normalize for length by computing average log probability per character, which ensures that it is tokenization agnostic. This approach is also used by eval harness in all multiple choice tasks and presented as acc_norm.

  • Unconditional likelihood normalized: The score of continuation $i$ is determined using $\sum_{j=m}^{n_i - 1} \log \mathbb P(x_j|x_{0:j}) - \log \mathbb P(x_j)$. Intuitively, this approach measures the amount that the prompt increases the model's probability of outputting each continuation from the probability of the model unconditionally producing that continuation. This approach is used by GPT-3 in select tasks (ARC, OpenBookQA, and RACE), though no justification for why only these tasks in particular use this method is provided other than that this improves performance.

The unnormalized, token-length normalized, and byte-length normalized metrics can be computed without additional LM calls. The unconditional likelihood normalized metric requires an additional LM call to obtain the unconditional likelihood.

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by Eigenvector · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

valuation

Knowledge Map

Knowledge Map
TopicsEntitiesSource
Multiple Ch…valuationEleutherAI …

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 211 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Models