Live
Black Hat USADark ReadingBlack Hat AsiaAI BusinessThis International Fact-Checking Day, use these 5 tips to spot AI-generated contentFast Company TechRio Receipt Protocol – Cryptographic Proof for AI ActionsHacker News AI TopNew AI testing method flags fairness risks in autonomous systemsTechXplore AI[D] Make. Big. Batch. Size.Reddit r/MachineLearningNew open source AI self driving testingHacker News AI TopAgentic AI and the next intelligence explosionHacker News AI TopThe miracle of AI agent-assisted learningHacker News AI TopGoogle launches Gemma 4, its "most intelligent" open model family, purpose-built for advanced reasoning and agentic workflows, under an Apache 2.0 license (The Keyword)TechmemeA Differentiable Programming System to Bridge Machine Learning and ScientificComputingDev.to AIWhy Your AI Copilot Builds the Wrong Thing (And How to Fix It)Dev.to AIThe productivity paradox of AI coding assistantsHacker News AI TopBuilding Sentinel Gate: A 3-Layer Security Pipeline for AI AgentsDev.to AIBlack Hat USADark ReadingBlack Hat AsiaAI BusinessThis International Fact-Checking Day, use these 5 tips to spot AI-generated contentFast Company TechRio Receipt Protocol – Cryptographic Proof for AI ActionsHacker News AI TopNew AI testing method flags fairness risks in autonomous systemsTechXplore AI[D] Make. Big. Batch. Size.Reddit r/MachineLearningNew open source AI self driving testingHacker News AI TopAgentic AI and the next intelligence explosionHacker News AI TopThe miracle of AI agent-assisted learningHacker News AI TopGoogle launches Gemma 4, its "most intelligent" open model family, purpose-built for advanced reasoning and agentic workflows, under an Apache 2.0 license (The Keyword)TechmemeA Differentiable Programming System to Bridge Machine Learning and ScientificComputingDev.to AIWhy Your AI Copilot Builds the Wrong Thing (And How to Fix It)Dev.to AIThe productivity paradox of AI coding assistantsHacker News AI TopBuilding Sentinel Gate: A 3-Layer Security Pipeline for AI AgentsDev.to AI
AI NEWS HUBbyEIGENVECTOREigenvector

ATime-Consistent Benchmark for Repository-Level Software Engineering Evaluation

arXivMarch 30, 202610 min read0 views
Source Quiz

arXiv:2603.26137v1 Announce Type: cross Abstract: Evaluation of repository-aware software engineering systems is often confounded by synthetic task design, prompt leakage, and temporal contamination between repository knowledge and future code changes. We present a time-consistent benchmark methodology that snapshots a repository at time T0, constructs repository-derived code knowledge using only artifacts available before T0, and evaluates on engineering tasks derived from pull requests merged in the future interval (T0, T1]. Each historical pull request is transformed into a natural-language — Xianpeng (Simon), Sun, Haonan Sun, Tian Yu, Sheng Ma, Qincheng Zhang, Lifei Rao, Chen Tian

View PDF HTML (experimental)

Abstract:Evaluation of repository-aware software engineering systems is often confounded by synthetic task design, prompt leakage, and temporal contamination between repository knowledge and future code changes. We present a time-consistent benchmark methodology that snapshots a repository at time T0, constructs repository-derived code knowledge using only artifacts available before T0, and evaluates on engineering tasks derived from pull requests merged in the future interval (T0, T1]. Each historical pull request is transformed into a natural-language task through an LLM-assisted prompt-generation pipeline, and the benchmark is formalized as a matched A/B comparison in which the same software engineering agent is evaluated with and without repository-derived code knowledge while all other variables are held constant. We also report a baseline characterization study on two open-source repositories, DragonFly and React, using three Claude-family models and four prompt granularities. Across both repositories, file-level F1 increases monotonically from minimal to guided prompts, reaching 0.8081 on DragonFly and 0.8078 on React for the strongest tested model. These results show that prompt construction is a first-order benchmark variable. More broadly, the benchmark highlights that temporal consistency and prompt control are core validity requirements for repository-aware software engineering evaluation.

Comments: 10 pages, 10 figures, 4 tables

Subjects:

Software Engineering (cs.SE); Artificial Intelligence (cs.AI)

Cite as: arXiv:2603.26137 [cs.SE]

(or arXiv:2603.26137v1 [cs.SE] for this version)

https://doi.org/10.48550/arXiv.2603.26137

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Simon Sun [view email] [v1] Fri, 27 Mar 2026 07:46:18 UTC (537 KB)

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by Eigenvector · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

researchpaperarxiv

Knowledge Map

Knowledge Map
TopicsEntitiesSource
ATime-Consi…researchpaperarxivaiartificial-…arXiv

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 165 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!