Live
Black Hat USADark ReadingBlack Hat AsiaAI BusinessSmart food safety: implementing AI for risk, compliance and control - New Food magazineGoogle News: AI SafetyDonald Trump's Iran Address: White House Confirms Major Security Update Following Toll ThreatsInternational Business TimesTikTok ran ads for AI apps that let users undress strangersBusiness InsiderEnd of an era: Elon Musk says Tesla is no longer producing the Model S and XBusiness InsiderOpenAI's new partner wants to build ads that can chat with youBusiness InsiderAnthropic confirms it leaked 512,000 lines of Claude Code source code — spilling some of its biggest secrets - TechRadarGoogle News: ClaudeOpenAI's new partner wants to build ads that can chat with you - Business InsiderGoogle News: OpenAIAI can clone open-source software in minutes, and that's a problemTechSpotQ1 2026 Shatters Venture Funding Records As AI Boom Pushes Startup Investment To Nearly $300BCrunchbase NewsMeet 'Dobby': The AI agent that could kill the app economyBusiness InsiderThis company is turning YouTube videos into TV shows as streamers chase Gen AlphaBusiness InsiderThe gig workers who are training humanoid robots at homeMIT Technology Review AIBlack Hat USADark ReadingBlack Hat AsiaAI BusinessSmart food safety: implementing AI for risk, compliance and control - New Food magazineGoogle News: AI SafetyDonald Trump's Iran Address: White House Confirms Major Security Update Following Toll ThreatsInternational Business TimesTikTok ran ads for AI apps that let users undress strangersBusiness InsiderEnd of an era: Elon Musk says Tesla is no longer producing the Model S and XBusiness InsiderOpenAI's new partner wants to build ads that can chat with youBusiness InsiderAnthropic confirms it leaked 512,000 lines of Claude Code source code — spilling some of its biggest secrets - TechRadarGoogle News: ClaudeOpenAI's new partner wants to build ads that can chat with you - Business InsiderGoogle News: OpenAIAI can clone open-source software in minutes, and that's a problemTechSpotQ1 2026 Shatters Venture Funding Records As AI Boom Pushes Startup Investment To Nearly $300BCrunchbase NewsMeet 'Dobby': The AI agent that could kill the app economyBusiness InsiderThis company is turning YouTube videos into TV shows as streamers chase Gen AlphaBusiness InsiderThe gig workers who are training humanoid robots at homeMIT Technology Review AI

Strategic Candidacy in Generative AI Arenas

arXivMarch 31, 202610 min read0 views
Source Quiz

arXiv:2603.26891v1 Announce Type: cross Abstract: AI arenas, which rank generative models from pairwise preferences of users, are a popular method for measuring the relative performance of models in the course of their organic use. Because rankings are computed from noisy preferences, there is a concern that model producers can exploit this randomness by submitting many models (e.g., multiple variants of essentially the same model) and thereby artificially improve the rank of their top models. This can lead to degradations in the quality, and therefore the usefulness, of the ranking. In this p — Chris Hays, Rachel Li, Bailey Flanigan, Manish Raghavan

View PDF HTML (experimental)

Abstract:AI arenas, which rank generative models from pairwise preferences of users, are a popular method for measuring the relative performance of models in the course of their organic use. Because rankings are computed from noisy preferences, there is a concern that model producers can exploit this randomness by submitting many models (e.g., multiple variants of essentially the same model) and thereby artificially improve the rank of their top models. This can lead to degradations in the quality, and therefore the usefulness, of the ranking. In this paper, we begin by establishing, both theoretically and in simulations calibrated to data from the platform Arena (formerly LMArena, Chatbot Arena), conditions under which producers can benefit from submitting clones when their goal is to be ranked highly. We then propose a new mechanism for ranking models from pairwise comparisons, called You-Rank-We-Rank (YRWR). It requires that producers submit rankings over their own models and uses these rankings to correct statistical estimates of model quality. We prove that this mechanism is approximately clone-robust, in the sense that a producer cannot improve their rank much by doing anything other than submitting each of their unique models exactly once. Moreover, to the extent that model producers are able to correctly rank their own models, YRWR improves overall ranking accuracy. In further simulations, we show that indeed the mechanism is approximately clone-robust and quantify improvements to ranking accuracy, even under producer misranking.

Comments: 43 pages, 5 figures

Subjects:

Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)

Cite as: arXiv:2603.26891 [cs.LG]

(or arXiv:2603.26891v1 [cs.LG] for this version)

https://doi.org/10.48550/arXiv.2603.26891

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Chris Hays [view email] [v1] Fri, 27 Mar 2026 18:12:58 UTC (1,723 KB)

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by AI News Hub · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

researchpaperarxiv

Knowledge Map

Knowledge Map
TopicsEntitiesSource
Strategic C…researchpaperarxivaiartificial-…arXiv

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 124 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Research Papers