ATime-Consistent Benchmark for Repository-Level Software Engineering Evaluation
arXiv:2603.26137v1 Announce Type: cross Abstract: Evaluation of repository-aware software engineering systems is often confounded by synthetic task design, prompt leakage, and temporal contamination between repository knowledge and future code changes. We present a time-consistent benchmark methodology that snapshots a repository at time T0, constructs repository-derived code knowledge using only artifacts available before T0, and evaluates on engineering tasks derived from pull requests merged in the future interval (T0, T1]. Each historical pull request is transformed into a natural-language — Xianpeng (Simon), Sun, Haonan Sun, Tian Yu, Sheng Ma, Qincheng Zhang, Lifei Rao, Chen Tian
View PDF HTML (experimental)
Abstract:Evaluation of repository-aware software engineering systems is often confounded by synthetic task design, prompt leakage, and temporal contamination between repository knowledge and future code changes. We present a time-consistent benchmark methodology that snapshots a repository at time T0, constructs repository-derived code knowledge using only artifacts available before T0, and evaluates on engineering tasks derived from pull requests merged in the future interval (T0, T1]. Each historical pull request is transformed into a natural-language task through an LLM-assisted prompt-generation pipeline, and the benchmark is formalized as a matched A/B comparison in which the same software engineering agent is evaluated with and without repository-derived code knowledge while all other variables are held constant. We also report a baseline characterization study on two open-source repositories, DragonFly and React, using three Claude-family models and four prompt granularities. Across both repositories, file-level F1 increases monotonically from minimal to guided prompts, reaching 0.8081 on DragonFly and 0.8078 on React for the strongest tested model. These results show that prompt construction is a first-order benchmark variable. More broadly, the benchmark highlights that temporal consistency and prompt control are core validity requirements for repository-aware software engineering evaluation.
Comments: 10 pages, 10 figures, 4 tables
Subjects:
Software Engineering (cs.SE); Artificial Intelligence (cs.AI)
Cite as: arXiv:2603.26137 [cs.SE]
(or arXiv:2603.26137v1 [cs.SE] for this version)
https://doi.org/10.48550/arXiv.2603.26137
arXiv-issued DOI via DataCite (pending registration)
Submission history
From: Simon Sun [view email] [v1] Fri, 27 Mar 2026 07:46:18 UTC (537 KB)
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
researchpaperarxiv
Bankai (卍解) — the first post-training adaptation method for true 1-bit LLMs.
I've been experimenting with Bonsai 8B — PrismML's true 1-bit model (every weight is literally 0 or 1, not ternary like BitNet). I realized that since weights are bits, the diff between two model behaviors is just a XOR mask. So I built a tool that searches for sparse XOR patches that modify model behavior. The basic idea: flip a row of weights, check if the model got better at the target task without breaking anything else, keep or revert. The set of accepted flips is the patch. What it does on held-out prompts the search never saw: Without patch: d/dx [x^7 + x] = 0 ✗ With patch: d/dx [x^7 + x] = 7x^6 + 1 ✓ Without patch: Is 113 prime? No, 113 is not prime ✗ With patch: Is 113 prime? Yes, 113 is a prime number ✓ 93 row flips. 0.007% of weights. ~1 KB. Zero inference overhead — the patched

In the Presence of the Minister of Energy, Cisco and King Abdullah University of Science and Technology (KAUST) launch landmark AI Institute to accelerate AI research, development, and talent in Saudi Arabia - Cisco Newsroom
In the Presence of the Minister of Energy, Cisco and King Abdullah University of Science and Technology (KAUST) launch landmark AI Institute to accelerate AI research, development, and talent in Saudi Arabia Cisco Newsroom
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Research Papers

Quantum computers might crack today's encryption far sooner than we thought
According to a study by engineers at Caltech and the UC Department of Physics, quantum computers do not need to be nearly as powerful as previously believed to crack the most advanced cryptographic technologies. The research claims that Shor's algorithm could break RSA public-key encryption using quantum computers with just... Read Entire Article



Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!