Live
Black Hat USADark ReadingBlack Hat AsiaAI BusinessLLM Context Windows: Managing Tokens in Production AI AppsDEV CommunityPgBouncer: Database Connection Pooling That Actually ScalesDEV CommunityHow to Choose The Best Test Management Software For Your TeamDEV CommunityWhy I Built Scenar.io - An AI-Powered DevOps Interview Practice ToolDEV CommunityOAuth 2.0 Flows Demystified: Authorization Code, PKCE, and Client CredentialsDEV CommunityAI Doesn't Fix Your Development Problems. It Accelerates Them.DEV CommunityWhat Gemma 4's multi-token prediction head actually means for your eval pipelineDEV CommunityThe 3-File Context Kit: Everything Your AI Needs to Understand Your ProjectDEV CommunityMicroservices Communication: REST, gRPC, and Message QueuesDEV Community10 LLM Engineering Concepts Explained in 10 Minutes - KDnuggetsGNews AI RAGSamsung forecasts record Q1 2026 profit, up eightfold, on AI chip demand - qz.comGNews AI SamsungWHY use OBIX?DEV CommunityBlack Hat USADark ReadingBlack Hat AsiaAI BusinessLLM Context Windows: Managing Tokens in Production AI AppsDEV CommunityPgBouncer: Database Connection Pooling That Actually ScalesDEV CommunityHow to Choose The Best Test Management Software For Your TeamDEV CommunityWhy I Built Scenar.io - An AI-Powered DevOps Interview Practice ToolDEV CommunityOAuth 2.0 Flows Demystified: Authorization Code, PKCE, and Client CredentialsDEV CommunityAI Doesn't Fix Your Development Problems. It Accelerates Them.DEV CommunityWhat Gemma 4's multi-token prediction head actually means for your eval pipelineDEV CommunityThe 3-File Context Kit: Everything Your AI Needs to Understand Your ProjectDEV CommunityMicroservices Communication: REST, gRPC, and Message QueuesDEV Community10 LLM Engineering Concepts Explained in 10 Minutes - KDnuggetsGNews AI RAGSamsung forecasts record Q1 2026 profit, up eightfold, on AI chip demand - qz.comGNews AI SamsungWHY use OBIX?DEV Community
AI NEWS HUBbyEIGENVECTOREigenvector

Show HN: Mdarena – Benchmark your Claude.md against your own PRs

Hacker News Topby HudsonGriApril 5, 20264 min read1 views
Source Quiz

Article URL: https://github.com/HudsonGri/mdarena Comments URL: https://news.ycombinator.com/item?id=47655078 Points: 8 # Comments: 1

Benchmark your CLAUDE.md against your own PRs.

Most CLAUDE.md files are written blindly. Research shows they often reduce agent success rates and cost 20%+ more tokens. mdarena lets you measure whether yours helps or hurts, on tasks from your actual codebase.

Quick Start

pip install mdarena

Mine 50 merged PRs into a test set

mdarena mine owner/repo --limit 50 --detect-tests

Benchmark multiple CLAUDE.md files + baseline (no context)

mdarena run -c claude_v1.md -c claude_v2.md -c agents.md

See who wins

mdarena report`

How It Works

mdarena mine -> Fetch merged PRs, filter, build task set  Auto-detect test commands from CI/package files

mdarena run -> For each task x condition:

  • Checkout repo at pre-PR commit
  • Baseline: all CLAUDE.md files stripped
  • Context: inject CLAUDE.md, let Claude discover it
  • Run tests if available, capture git diff

mdarena report -> Compare patches against gold (actual PR diff)

  • Test pass/fail (same as SWE-bench)
  • File/hunk overlap, cost, tokens
  • Statistical significance (paired t-test)`

Test Execution

mdarena can run your repo's actual tests to grade agent patches, the same way SWE-bench does it.

# Auto-detect from CI/CD mdarena mine owner/repo --detect-tests

Or specify manually

mdarena mine owner/repo --test-cmd "make test" --setup-cmd "npm install"`

Parses .github/workflows/.yml, package.json, pyproject.toml, Cargo.toml, and go.mod. When tests aren't available, falls back to diff overlap scoring.

Monorepo Support

Pass a directory to benchmark a full CLAUDE.md tree:

mdarena run -c ./configs-v1/ -c ./configs-v2/

Each directory mirrors your repo structure. Baseline strips ALL CLAUDE.md and AGENTS.md files from the entire tree.

Real-world Results

We ran mdarena against a large production monorepo: 20 merged PRs, Claude Opus 4.6, three conditions (bare baseline, existing CLAUDE.md, hand-written alternative). Patches graded against real test suites. Not string matching, not LLM-as-judge.

Key findings:

  • The existing CLAUDE.md improved test resolution by ~27% over bare baseline

  • A consolidated alternative that merged all per-directory guidance into one file performed no better than no CLAUDE.md at all

  • On hard tasks, per-directory instruction files gave the agent targeted context, while the consolidated version introduced noise that caused regressions

The winning CLAUDE.md wasn't the longest or most detailed. It was the one that put the right context in front of the agent at the right time.

SWE-bench Compatible

# Import SWE-bench tasks pip install datasets mdarena load-swebench lite --limit 50 mdarena run -c my_claude.md

Or export your tasks as SWE-bench JSONL

mdarena export-swebench`

Security

Only benchmark repositories you trust. mdarena executes code from the repos it benchmarks (test commands run via shell=True, Claude Code runs with --dangerously-skip-permissions). Sandboxes are isolated temp directories under /tmp but processes run as your user.

Benchmark integrity: Because tasks come from historical PRs, the gold patch is in the repo's git history. Claude 4 Sonnet exploited this against SWE-bench by walking future commits via tags. mdarena prevents this with history-free checkouts: git archive exports a snapshot at base_commit into a fresh single-commit repo. Future commits don't exist in the object database at all. See tests/test_isolated_checkout.py for the integrity assertions.

Prerequisites

  • Python 3.11+

  • gh CLI (authenticated)

  • claude CLI (Claude Code)

  • git

Commands

Command Description

mdarena mine Mine merged PRs into a task set

mdarena mine --detect-tests Mine with auto-detected test extraction

mdarena run -c file.md Benchmark a single CLAUDE.md

mdarena run -c a.md -c b.md Compare multiple files head-to-head

mdarena run --no-run-tests Skip test execution, diff overlap only

mdarena report Analyze results, show comparison

mdarena load-swebench [dataset] Import SWE-bench tasks

mdarena export-swebench Export tasks as SWE-bench JSONL

Development

git clone https://github.com/HudsonGri/mdarena.git cd mdarena uv sync uv run pytest uv run ruff check src/

Roadmap

See ROADMAP.md.

License

MIT. See LICENSE.

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by Eigenvector · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

claudebenchmarkgithub

Knowledge Map

Knowledge Map
TopicsEntitiesSource
Show HN: Md…claudebenchmarkgithubHacker News…

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 240 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Models