Brevity Constraints Reverse Performance Hierarchies in Language Models
Large language models can underperform smaller ones due to verbose responses that introduce errors, but constraining output length reveals their superior capabilities and improves performance across benchmarks. (16 upvotes on HuggingFace)
We evaluate 31 language models (0.5B–405B parameters) across 1,485 problemsfrom five standard benchmarks and identify a systematic but correctable failuremode: on 7.7% of problems, small models (≤10B) outperform large models (≥70B)by an average of 28.4 percentage points (Cohen's d = 1.34).
The mechanism is scale-dependent verbosity. Large models spontaneouslygenerate responses 59% longer than small models on these problems — not throughmore explicit reasoning steps (9.1 vs 10.5), but through verbose implicitelaboration that accumulates errors. We call this overthinking.
A simple intervention reverses the hierarchy. Adding brevity constraints("answer in under 50 words") improves large model accuracy by +26.3pp andreduces the performance gap by 67%. Critically, on GSM8K and MMLU-STEM, thegap doesn't just close — it fully reverses: large models go from losing by13.1pp and 27.3pp to winning by 7.7pp and 15.9pp respectively.
This effect is architecture-independent, replicating across Llama, Qwen,Gemma, and Mistral families (5/5 datasets each), and operates continuouslyacross the full parameter spectrum (Pearson r = −0.388, p = 0.0035).
Three independent contamination tests (response diversity: 89–100% unique;length variability: CV = 0.31–1.21; error taxonomy: 41–82% over-reasoningfailures) confirm genuine capability differences rather than memorizationartifacts.
The implication: inverse scaling on standard benchmarks reflects promptdesign failure, not architectural limitation. Large models possess superiorlatent capabilities that universal prompting obscures. Scale-aware promptengineering — not larger models or retraining — is sufficient to recover them.
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.

Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!