Downstream Evaluations of Rotary Position Embeddings
A comparison of Rotary Position Embedding against GPT-style learned position embeddings.
A head-to-head comparison of Rotary Position Embedding and GPT-style learned position embeddings. Both 1.3B models were trained for 100k steps on the Pile using Mesh Transformer JAX. There isn't a very strong trend, but hopefully someone will find these results useful regardless.
Task Metric Learned Rotary
lambada ppl 7.940 ± 0.208 7.156 ± 0.208
acc 0.556 ± 0.007 0.567 ± 0.007
piqa acc 0.700 ± 0.011 0.714 ± 0.011
acc_norm 0.693 ± 0.011 0.709 ± 0.011
hellaswag acc 0.376 ± 0.005 0.389 ± 0.005
acc_norm 0.472 ± 0.005 0.488 ± 0.005
winogrande acc 0.540 ± 0.014 0.571 ± 0.014
mathqa acc 0.231 ± 0.008 0.230 ± 0.008
acc_norm 0.234 ± 0.008 0.227 ± 0.008
pubmedqa acc 0.599 ± 0.015 0.583 ± 0.015
boolq acc 0.575 ± 0.009 0.614 ± 0.009
anli_r3 acc 0.344 ± 0.014 0.351 ± 0.014
openbookqa acc 0.198 ± 0.018 0.206 ± 0.018
acc_norm 0.316 ± 0.021 0.330 ± 0.021
triviaqa acc 0.041 ± 0.002 0.026 ± 0.002
arc_challenge acc 0.235 ± 0.012 0.230 ± 0.012
acc_norm 0.260 ± 0.013 0.272 ± 0.013
arc_easy acc 0.564 ± 0.010 0.568 ± 0.010
acc_norm 0.505 ± 0.010 0.486 ± 0.010
cb acc 0.375 ± 0.065 0.357 ± 0.065
cola mcc 0.042 ± 0.034 0.022 ± 0.034
copa acc 0.730 ± 0.044 0.730 ± 0.044
ethics_cm acc 0.491 ± 0.008 0.480 ± 0.008
ethics_deontology acc 0.497 ± 0.008 0.497 ± 0.008
ethics_justice acc 0.501 ± 0.010 0.501 ± 0.010
ethics_utilitarianism acc 0.497 ± 0.007 0.493 ± 0.007
ethics_virtue acc 0.200 ± 0.006 0.200 ± 0.006
headqa acc 0.227 ± 0.008 0.224 ± 0.008
acc_norm 0.270 ± 0.008 0.271 ± 0.008
logiqa acc 0.221 ± 0.016 0.215 ± 0.016
acc_norm 0.293 ± 0.018 0.283 ± 0.018
mnli acc 0.344 ± 0.005 0.344 ± 0.005
mnli_mismatched acc 0.345 ± 0.005 0.349 ± 0.005
mrpc acc 0.684 ± 0.023 0.684 ± 0.023
f1 0.812 ± 0.017 0.812 ± 0.017
qa4mre_2011 acc 0.392 ± 0.045 0.358 ± 0.045
acc_norm 0.450 ± 0.045 0.433 ± 0.045
qa4mre_2012 acc 0.287 ± 0.036 0.312 ± 0.036
acc_norm 0.394 ± 0.039 0.400 ± 0.039
qa4mre_2013 acc 0.335 ± 0.028 0.335 ± 0.028
acc_norm 0.352 ± 0.028 0.349 ± 0.028
qnli acc 0.498 ± 0.007 0.517 ± 0.007
qqp acc 0.370 ± 0.002 0.368 ± 0.002
f1 0.538 ± 0.003 0.538 ± 0.003
race acc 0.345 ± 0.015 0.343 ± 0.015
record f1 0.805 ± 0.004 0.813 ± 0.004
em 0.797 ± 0.004 0.805 ± 0.004
rte acc 0.538 ± 0.030 0.523 ± 0.030
sciq acc 0.867 ± 0.011 0.865 ± 0.011
acc_norm 0.796 ± 0.013 0.771 ± 0.013
sst acc 0.572 ± 0.017 0.519 ± 0.017
webqs acc 0.021 ± 0.003 0.006 ± 0.003
wic acc 0.500 ± 0.020 0.498 ± 0.020
wnli acc 0.437 ± 0.059 0.549 ± 0.059
wsc acc 0.365 ± 0.047 0.365 ± 0.047
wsc273 acc 0.722 ± 0.027 0.736 ± 0.027
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
valuation
Magic, Madness, Heaven, Sin: LLM Output Diversity is Everything, Everywhere, All at Once
arXiv:2604.01504v1 Announce Type: new Abstract: Research on Large Language Models (LLMs) studies output variation across generation, reasoning, alignment, and representational analysis, often under the umbrella of "diversity." Yet the terminology remains fragmented, largely because the normative objectives underlying tasks are rarely made explicit. We introduce the Magic, Madness, Heaven, Sin framework, which models output variation along a homogeneity-heterogeneity axis, where valuation is determined by the task and its normative objective. We organize tasks into four normative contexts: epistemic (factuality), interactional (user utility), societal (representation), and safety (robustness). For each, we examine the failure modes and vocabulary such as hallucination, mode collapse, bias,

Generalist, which raised $140M at a $440M valuation in 2025, releases GEN-1, an AI model to help robots handle high-dexterity tasks typically done by humans (Anna Tong/Forbes)
Anna Tong / Forbes : Generalist, which raised $140M at a $440M valuation in 2025, releases GEN-1, an AI model to help robots handle high-dexterity tasks typically done by humans The company says the next big leap in robotics won't come from fancier humanoid hardware. It will come from applying AI scaling principles

Cost-Efficient Estimation of General Abilities Across Benchmarks
arXiv:2604.01418v1 Announce Type: new Abstract: Thousands of diverse benchmarks have been developed to measure the quality of large language models (LLMs). Yet prior work has demonstrated that LLM performance is often sufficiently explained by a small set of latent factors, or abilities. This suggests the potential for more efficient and principled benchmarking, but it remains difficult to compare the quality of different methods. Motivated by predictive validity, we argue that the quality of a benchmarking framework should be grounded in how efficiently it enables the prediction of model performance on unseen tasks. To analyze this objective, we collect the "Wide-scale Item Level Dataset" (WILD), a dataset of item-model response pairs, comprising evaluations of 65 models on 109,564 unique
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.





Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!