Show HN: Mdarena – Benchmark your Claude.md against your own PRs
Article URL: https://github.com/HudsonGri/mdarena Comments URL: https://news.ycombinator.com/item?id=47655078 Points: 8 # Comments: 1
Benchmark your CLAUDE.md against your own PRs.
Most CLAUDE.md files are written blindly. Research shows they often reduce agent success rates and cost 20%+ more tokens. mdarena lets you measure whether yours helps or hurts, on tasks from your actual codebase.
Quick Start
pip install mdarena
Mine 50 merged PRs into a test set
mdarena mine owner/repo --limit 50 --detect-tests
Benchmark multiple CLAUDE.md files + baseline (no context)
mdarena run -c claude_v1.md -c claude_v2.md -c agents.md
See who wins
mdarena report`
How It Works
mdarena mine -> Fetch merged PRs, filter, build task set Auto-detect test commands from CI/package filesmdarena mine -> Fetch merged PRs, filter, build task set Auto-detect test commands from CI/package filesmdarena run -> For each task x condition:
- Checkout repo at pre-PR commit
- Baseline: all CLAUDE.md files stripped
- Context: inject CLAUDE.md, let Claude discover it
- Run tests if available, capture git diff
mdarena report -> Compare patches against gold (actual PR diff)
- Test pass/fail (same as SWE-bench)
- File/hunk overlap, cost, tokens
- Statistical significance (paired t-test)`
Test Execution
mdarena can run your repo's actual tests to grade agent patches, the same way SWE-bench does it.
# Auto-detect from CI/CD mdarena mine owner/repo --detect-tests# Auto-detect from CI/CD mdarena mine owner/repo --detect-testsOr specify manually
mdarena mine owner/repo --test-cmd "make test" --setup-cmd "npm install"`
Parses .github/workflows/.yml, package.json, pyproject.toml, Cargo.toml, and go.mod. When tests aren't available, falls back to diff overlap scoring.
Monorepo Support
Pass a directory to benchmark a full CLAUDE.md tree:
mdarena run -c ./configs-v1/ -c ./configs-v2/
Each directory mirrors your repo structure. Baseline strips ALL CLAUDE.md and AGENTS.md files from the entire tree.
Real-world Results
We ran mdarena against a large production monorepo: 20 merged PRs, Claude Opus 4.6, three conditions (bare baseline, existing CLAUDE.md, hand-written alternative). Patches graded against real test suites. Not string matching, not LLM-as-judge.
Key findings:
-
The existing CLAUDE.md improved test resolution by ~27% over bare baseline
-
A consolidated alternative that merged all per-directory guidance into one file performed no better than no CLAUDE.md at all
-
On hard tasks, per-directory instruction files gave the agent targeted context, while the consolidated version introduced noise that caused regressions
The winning CLAUDE.md wasn't the longest or most detailed. It was the one that put the right context in front of the agent at the right time.
SWE-bench Compatible
# Import SWE-bench tasks pip install datasets mdarena load-swebench lite --limit 50 mdarena run -c my_claude.md# Import SWE-bench tasks pip install datasets mdarena load-swebench lite --limit 50 mdarena run -c my_claude.mdOr export your tasks as SWE-bench JSONL
mdarena export-swebench`
Security
Only benchmark repositories you trust. mdarena executes code from the repos it benchmarks (test commands run via shell=True, Claude Code runs with --dangerously-skip-permissions). Sandboxes are isolated temp directories under /tmp but processes run as your user.
Benchmark integrity: Because tasks come from historical PRs, the gold patch is in the repo's git history. Claude 4 Sonnet exploited this against SWE-bench by walking future commits via tags. mdarena prevents this with history-free checkouts: git archive exports a snapshot at base_commit into a fresh single-commit repo. Future commits don't exist in the object database at all. See tests/test_isolated_checkout.py for the integrity assertions.
Prerequisites
-
Python 3.11+
-
gh CLI (authenticated)
-
claude CLI (Claude Code)
-
git
Commands
Command Description
mdarena mine
Mine merged PRs into a task set
mdarena mine --detect-tests
Mine with auto-detected test extraction
mdarena run -c file.md
Benchmark a single CLAUDE.md
mdarena run -c a.md -c b.md
Compare multiple files head-to-head
mdarena run --no-run-tests
Skip test execution, diff overlap only
mdarena report
Analyze results, show comparison
mdarena load-swebench [dataset]
Import SWE-bench tasks
mdarena export-swebench
Export tasks as SWE-bench JSONL
Development
git clone https://github.com/HudsonGri/mdarena.git cd mdarena uv sync uv run pytest uv run ruff check src/git clone https://github.com/HudsonGri/mdarena.git cd mdarena uv sync uv run pytest uv run ruff check src/Roadmap
See ROADMAP.md.
License
MIT. See LICENSE.
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
claudebenchmarkgithub
Radar Trends to Watch: April 2026
Starting with this issue of Trends, we ve moved from simply reporting on news that has caught our eye and instead have worked with Claude to look at the various news items we ve collected and to reflect on what they tell us about the direction and magnitude of change. William Gibson famously wrote, The future is [ ]

What Gemma 4's multi-token prediction head actually means for your eval pipeline
Gemma 4 dropped with a multi-token prediction (MTP) head and immediately every benchmark thread on r/LocalLLaMA and r/MachineLearning filled up with MMLU scores, HumanEval numbers, and throughput charts. Most of those benchmarks are not measuring what the MTP head actually changes. Here's what's actually happening, and what it means if you're running your own eval pipeline. What MTP actually is Standard autoregressive generation predicts one token at a time. At each step, the model outputs a probability distribution over the vocabulary, samples a token, appends it, and repeats. Multi-token prediction trains an additional head to predict multiple future tokens simultaneously. The core model still generates token-by-token at inference time, but the MTP head is used during training as an auxi

OAuth 2.0 Flows Demystified: Authorization Code, PKCE, and Client Credentials
OAuth 2.0 Is Not Authentication OAuth 2.0 is an authorization framework. It answers: "Can application X access resource Y on behalf of user Z?" OpenID Connect (OIDC) layers authentication on top: "Who is this user?" Most developers use both without realizing it. The Four Flows 1. Authorization Code Flow (Web Apps) The standard flow for web applications with a backend. Browser → Your App → GitHub/Google ("Allow access?") → Your App (with code) → Exchange code for token // Step 1: Redirect user to provider app . get ( ' /auth/github ' , ( req , res ) => { const state = generateRandomString ( 16 ); // CSRF protection req . session . oauthState = state ; const params = new URLSearchParams ({ client_id : process . env . GITHUB_CLIENT_ID ! , redirect_uri : ` ${ process . env . APP_URL } /auth/gi
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Models

What Gemma 4's multi-token prediction head actually means for your eval pipeline
Gemma 4 dropped with a multi-token prediction (MTP) head and immediately every benchmark thread on r/LocalLLaMA and r/MachineLearning filled up with MMLU scores, HumanEval numbers, and throughput charts. Most of those benchmarks are not measuring what the MTP head actually changes. Here's what's actually happening, and what it means if you're running your own eval pipeline. What MTP actually is Standard autoregressive generation predicts one token at a time. At each step, the model outputs a probability distribution over the vocabulary, samples a token, appends it, and repeats. Multi-token prediction trains an additional head to predict multiple future tokens simultaneously. The core model still generates token-by-token at inference time, but the MTP head is used during training as an auxi


Anthropic Accidentally Exposes Claude Code Source via npm Source Map File
Anthropic's Claude Code CLI had its full TypeScript source exposed after a source map file was accidentally included in version 2.1.88 of its npm package. The 512,000-line codebase was archived to GitHub within hours. Anthropic called it a packaging error caused by human error. The leak revealed unreleased features, internal model codenames, and multi-agent orchestration architecture. By Steef-Jan Wiggers


Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!