The Effect of Attention Head Count on Transformer Approximation
arXiv:2510.06662v2 Announce Type: replace-cross Abstract: Transformer has become the dominant architecture for sequence modeling, yet a detailed understanding of how its structural parameters influence expressive power remains limited. In this work, we study the approximation properties of transformers, with particular emphasis on the role of the number of attention heads. Our analysis begins with the introduction of a generalized $D$-retrieval task, which we prove to be dense in the space of continuous functions, thereby providing the basis for our theoretical framework. We then establish both upper and lower bounds on the parameter complexity required for $\epsilon$-approximation. Specifically, we show that transformers with sufficiently many heads admit efficient approximation, whereas
View PDF HTML (experimental)
Abstract:Transformer has become the dominant architecture for sequence modeling, yet a detailed understanding of how its structural parameters influence expressive power remains limited. In this work, we study the approximation properties of transformers, with particular emphasis on the role of the number of attention heads. Our analysis begins with the introduction of a generalized $D$-retrieval task, which we prove to be dense in the space of continuous functions, thereby providing the basis for our theoretical framework. We then establish both upper and lower bounds on the parameter complexity required for $\epsilon$-approximation. Specifically, we show that transformers with sufficiently many heads admit efficient approximation, whereas with too few heads, the number of parameters must scale at least as $O(1/\epsilon^{cT})$, for some constant $c$ and sequence length $T$. To the best of our knowledge, this constitutes the first rigorous lower bound of this type in a nonlinear and practically relevant setting. We further examine the single-head case and demonstrate that an embedding dimension of order $O(T)$ allows complete memorization of the input, where approximation is entirely achieved by the feed-forward block. Finally, we validate our theoretical findings with experiments on both synthetic data and real-world tasks, illustrating the practical relevance of our results.
Comments: Accepted by ICLR 2026
Subjects:
Machine Learning (cs.LG); Machine Learning (stat.ML)
Cite as: arXiv:2510.06662 [cs.LG]
(or arXiv:2510.06662v2 [cs.LG] for this version)
https://doi.org/10.48550/arXiv.2510.06662
arXiv-issued DOI via DataCite
Submission history
From: Penghao Yu [view email] [v1] Wed, 8 Oct 2025 05:27:25 UTC (443 KB) [v2] Tue, 31 Mar 2026 07:14:19 UTC (454 KB)
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
modeltransformerannounce
AI Safety at the Frontier: Paper Highlights of February & March 2026
tl;dr Paper of the month: A benchmark of 56 model organisms with hidden behaviors finds that auditing-tool rankings depend heavily on how the organism was trained — and the investigator agent, not the tools, is the bottleneck. Research highlights: Linear “emotion vectors” in Claude causally drive misalignment: “desperate” steering raises blackmail from 22% to 72%, “calm” drops it to 0%. Emergent misalignment is the optimizer’s preferred solution — more efficient and more stable than staying narrowly misaligned. Scheming propensity in realistic settings is near 0%, but can dramatically increase from one prompt snippet or tool change. AI self-monitors are up to 5× more likely to approve an action shown as their own prior turn — driven by implicit cues, not stated authorship. Reasoning models

Self-Aware Confabulation
All men are frauds. The only difference between them is that some admit it. I myself deny it. ― H. L. Mencken I think where I am not, therefore I am where I do not think. I am not whenever I am the plaything of my thought; I think of what I am where I do not think to think. ― Jacques Lacan Conscience is the inner voice that warns us somebody may be looking. ― H. L. Mencken, again The Elephant in the Brain by Robin Hanson and Kevin Simler was the piece that first introduced me to the idea. I often felt like the Elephant's takes are overly cynical, and the same goes for other pieces of Hanson's writing. That is, before I read Edward Teach's Sadly, Porn that is outright misanthropic, and still feels pretty accurate whenever I can make any sense of it. The core thesis in both of these books is
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Research Papers
![[D] ICML reviewer making up false claim in acknowledgement, what to do?](https://d2xsxph8kpxj0f.cloudfront.net/310419663032563854/konzwo8nGf8Z4uZsMefwMr/default-img-matrix-rain-CvjLrWJiXfamUnvj5xT9J9.webp)
[D] ICML reviewer making up false claim in acknowledgement, what to do?
In a rebuttal acknowledgement we received, the reviewer made up a claim that our method performs worse than baselines with some hyperparameter settings. We did do a comprehensive list of hyperparameter comparisons and the reviewer's claim is not supported by what's presented in the paper. In this case what can we do? submitted by /u/dontknowwhattoplay [link] [comments]




Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!