Downstream Evaluations of Rotary Position Embeddings

EleutherAI Blogby Leo GaoAugust 16, 20213 min read2 views

A comparison of Rotary Position Embedding against GPT-style learned position embeddings.

A head-to-head comparison of Rotary Position Embedding and GPT-style learned position embeddings. Both 1.3B models were trained for 100k steps on the Pile using Mesh Transformer JAX. There isn't a very strong trend, but hopefully someone will find these results useful regardless.

Task Metric Learned Rotary

lambada ppl 7.940 ± 0.208 7.156 ± 0.208

acc 0.556 ± 0.007 0.567 ± 0.007

piqa acc 0.700 ± 0.011 0.714 ± 0.011

acc_norm 0.693 ± 0.011 0.709 ± 0.011

hellaswag acc 0.376 ± 0.005 0.389 ± 0.005

acc_norm 0.472 ± 0.005 0.488 ± 0.005

winogrande acc 0.540 ± 0.014 0.571 ± 0.014

mathqa acc 0.231 ± 0.008 0.230 ± 0.008

acc_norm 0.234 ± 0.008 0.227 ± 0.008

pubmedqa acc 0.599 ± 0.015 0.583 ± 0.015

boolq acc 0.575 ± 0.009 0.614 ± 0.009

anli_r3 acc 0.344 ± 0.014 0.351 ± 0.014

openbookqa acc 0.198 ± 0.018 0.206 ± 0.018

acc_norm 0.316 ± 0.021 0.330 ± 0.021

triviaqa acc 0.041 ± 0.002 0.026 ± 0.002

arc_challenge acc 0.235 ± 0.012 0.230 ± 0.012

acc_norm 0.260 ± 0.013 0.272 ± 0.013

arc_easy acc 0.564 ± 0.010 0.568 ± 0.010

acc_norm 0.505 ± 0.010 0.486 ± 0.010

cb acc 0.375 ± 0.065 0.357 ± 0.065

cola mcc 0.042 ± 0.034 0.022 ± 0.034

copa acc 0.730 ± 0.044 0.730 ± 0.044

ethics_cm acc 0.491 ± 0.008 0.480 ± 0.008

ethics_deontology acc 0.497 ± 0.008 0.497 ± 0.008

ethics_justice acc 0.501 ± 0.010 0.501 ± 0.010

ethics_utilitarianism acc 0.497 ± 0.007 0.493 ± 0.007

ethics_virtue acc 0.200 ± 0.006 0.200 ± 0.006

headqa acc 0.227 ± 0.008 0.224 ± 0.008

acc_norm 0.270 ± 0.008 0.271 ± 0.008

logiqa acc 0.221 ± 0.016 0.215 ± 0.016

acc_norm 0.293 ± 0.018 0.283 ± 0.018

mnli acc 0.344 ± 0.005 0.344 ± 0.005

mnli_mismatched acc 0.345 ± 0.005 0.349 ± 0.005

mrpc acc 0.684 ± 0.023 0.684 ± 0.023

f1 0.812 ± 0.017 0.812 ± 0.017

qa4mre_2011 acc 0.392 ± 0.045 0.358 ± 0.045

acc_norm 0.450 ± 0.045 0.433 ± 0.045

qa4mre_2012 acc 0.287 ± 0.036 0.312 ± 0.036

acc_norm 0.394 ± 0.039 0.400 ± 0.039

qa4mre_2013 acc 0.335 ± 0.028 0.335 ± 0.028

acc_norm 0.352 ± 0.028 0.349 ± 0.028

qnli acc 0.498 ± 0.007 0.517 ± 0.007

qqp acc 0.370 ± 0.002 0.368 ± 0.002

f1 0.538 ± 0.003 0.538 ± 0.003

race acc 0.345 ± 0.015 0.343 ± 0.015

record f1 0.805 ± 0.004 0.813 ± 0.004

em 0.797 ± 0.004 0.805 ± 0.004

rte acc 0.538 ± 0.030 0.523 ± 0.030

sciq acc 0.867 ± 0.011 0.865 ± 0.011

acc_norm 0.796 ± 0.013 0.771 ± 0.013

sst acc 0.572 ± 0.017 0.519 ± 0.017

webqs acc 0.021 ± 0.003 0.006 ± 0.003

wic acc 0.500 ± 0.020 0.498 ± 0.020

wnli acc 0.437 ± 0.059 0.549 ± 0.059

wsc acc 0.365 ± 0.047 0.365 ± 0.047

wsc273 acc 0.722 ± 0.027 0.736 ± 0.027

Original source

EleutherAI Blog

https://blog.eleuther.ai/rotary-embeddings-eval-harness/

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

valuation

Frontier ResearchFresh

Magic, Madness, Heaven, Sin: LLM Output Diversity is Everything, Everywhere, All at Once

arXiv:2604.01504v1 Announce Type: new Abstract: Research on Large Language Models (LLMs) studies output variation across generation, reasoning, alignment, and representational analysis, often under the umbrella of "diversity." Yet the terminology remains fragmented, largely because the normative objectives underlying tasks are rarely made explicit. We introduce the Magic, Madness, Heaven, Sin framework, which models output variation along a homogeneity-heterogeneity axis, where valuation is determined by the task and its normative objective. We organize tasks into four normative contexts: epistemic (factuality), interactional (user utility), societal (representation), and safety (robustness). For each, we examine the failure modes and vocabulary such as hallucination, mode collapse, bias,

arXiv cs.CL

1mabout 2 hours ago

ProductsLive

Generalist, which raised $140M at a $440M valuation in 2025, releases GEN-1, an AI model to help robots handle high-dexterity tasks typically done by humans (Anna Tong/Forbes)

Anna Tong / Forbes : Generalist, which raised $140M at a $440M valuation in 2025, releases GEN-1, an AI model to help robots handle high-dexterity tasks typically done by humans The company says the next big leap in robotics won't come from fancier humanoid hardware. It will come from applying AI scaling principles

Techmeme

1mabout 1 hour ago

ModelsFresh

Cost-Efficient Estimation of General Abilities Across Benchmarks

arXiv:2604.01418v1 Announce Type: new Abstract: Thousands of diverse benchmarks have been developed to measure the quality of large language models (LLMs). Yet prior work has demonstrated that LLM performance is often sufficiently explained by a small set of latent factors, or abilities. This suggests the potential for more efficient and principled benchmarking, but it remains difficult to compare the quality of different methods. Motivated by predictive validity, we argue that the quality of a benchmarking framework should be grounded in how efficiently it enables the prediction of model performance on unseen tasks. To analyze this objective, we collect the "Wide-scale Item Level Dataset" (WILD), a dataset of item-model response pairs, comprising evaluations of 65 models on 109,564 unique