Live
Black Hat USADark ReadingBlack Hat AsiaAI Business跳出幸存者偏差,从结构性资源分配解析财富真相Dev.to AIJapan's Sakura Internet jumps 20% as Microsoft plans $10 billion AI push with SoftBank - CNBCGNews AI JapanOpenClaw vs Cloud AI: Which One Actually Gives Businesses More Control?Medium AI“In a World of AI Content, Being Human Is Your Superpower”Medium AIHow AI is Transforming the Role of a CFO in 2026.Medium AIHow to Build Self-Running AI Tasks with TypeScript (No Cron Jobs Needed)Dev.to AIFaked Fire Drill!Medium AIMicrosoft To Invest $10 Bn For Japan AI Data Centres - Barron'sGNews AI Japanv4.3.1text-gen-webui ReleasesThe Sentinel: AI-Powered Zero-Touch Insurance for Gig WorkersDev.to AIDecision Trees from Data: Building Context-Aware ModelsDev.to AIFrom Crisis to Clinic: How AI Automates Drug Shortage ResolutionDev.to AIBlack Hat USADark ReadingBlack Hat AsiaAI Business跳出幸存者偏差,从结构性资源分配解析财富真相Dev.to AIJapan's Sakura Internet jumps 20% as Microsoft plans $10 billion AI push with SoftBank - CNBCGNews AI JapanOpenClaw vs Cloud AI: Which One Actually Gives Businesses More Control?Medium AI“In a World of AI Content, Being Human Is Your Superpower”Medium AIHow AI is Transforming the Role of a CFO in 2026.Medium AIHow to Build Self-Running AI Tasks with TypeScript (No Cron Jobs Needed)Dev.to AIFaked Fire Drill!Medium AIMicrosoft To Invest $10 Bn For Japan AI Data Centres - Barron'sGNews AI Japanv4.3.1text-gen-webui ReleasesThe Sentinel: AI-Powered Zero-Touch Insurance for Gig WorkersDev.to AIDecision Trees from Data: Building Context-Aware ModelsDev.to AIFrom Crisis to Clinic: How AI Automates Drug Shortage ResolutionDev.to AI
AI NEWS HUBbyEIGENVECTOREigenvector

PRBench: End-to-end Paper Reproduction in Physics Research

HuggingFace PapersMarch 29, 20268 min read0 views
Source Quiz

PRBench evaluates AI agents' ability to reproduce scientific research by requiring them to implement algorithms from published papers and match original results, revealing significant challenges in formula implementation, debugging, and data accuracy. (7 upvotes on HuggingFace)

Published on Mar 29

Authors:

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

Abstract

PRBench evaluates AI agents' ability to reproduce scientific research by requiring them to implement algorithms from published papers and match original results, revealing significant challenges in formula implementation, debugging, and data accuracy.

AI-generated summary

AI agents powered by large language models exhibit strong reasoning and problem-solving capabilities, enabling them to assist scientific research tasks such as formula derivation and code generation. However, whether these agents can reliably perform end-to-end reproduction from real scientific papers remains an open question. We introduce PRBench, a benchmark of 30 expert-curated tasks spanning 11 subfields of physics. Each task requires an agent to comprehend the methodology of a published paper, implement the corresponding algorithms from scratch, and produce quantitative results matching the original publication. Agents are provided only with the task instruction and paper content, and operate in a sandboxed execution environment. All tasks are contributed by domain experts from over 20 research groups at the School of Physics, Peking University, each grounded in a real published paper and validated through end-to-end reproduction with verified ground-truth results and detailed scoring rubrics. Using an agentified assessment pipeline, we evaluate a set of coding agents on PRBench and analyze their capabilities across key dimensions of scientific reasoning and execution. The best-performing agent, OpenAI Codex powered by GPT-5.3-Codex, achieves a mean overall score of 34%. All agents exhibit a zero end-to-end callback success rate, with particularly poor performance in data accuracy and code correctness. We further identify systematic failure modes, including errors in formula implementation, inability to debug numerical simulations, and fabrication of output data. Overall, PRBench provides a rigorous benchmark for evaluating progress toward autonomous scientific research.

View arXiv page View PDF Project page GitHub 2 Add to collection

Get this paper in your agent:

hf papers read 2603.27646

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2603.27646 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2603.27646 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2603.27646 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by Eigenvector · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

researchpaperarxiv

Knowledge Map

Knowledge Map
TopicsEntitiesSource
PRBench: En…researchpaperarxivlarge langu…scientific …end-to-end …HuggingFace…

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 226 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Research Papers