Show HN: Initialize an AI Harness with Forge CLI
Article URL: https://github.com/samahlstrom/forge-cli Comments URL: https://news.ycombinator.com/item?id=47594802 Points: 1 # Comments: 1
AI agent pipeline scaffolding for Claude Code. Turns any repository into an orchestrated, multi-agent workspace with decomposition, risk classification, parallel execution, tri-agent adversarial evaluation, and automated delivery.
Built in Go. Cross-platform. Single binary.
Install
Homebrew (macOS / Linux)
brew install samahlstrom/tap/forge
Note: Always use the full tap path (samahlstrom/tap/forge). There is an unrelated tool also called forge in Homebrew core — brew install forge will install the wrong thing.
Go install
go install github.com/samahlstrom/forge-cli@latest
Requires Go 1.22+.
Download binary
Pre-built binaries for macOS, Linux, and Windows (amd64/arm64) are available on the GitHub Releases page.
Build from source
git clone https://github.com/samahlstrom/forge-cli.git cd forge-cli make installgit clone https://github.com/samahlstrom/forge-cli.git cd forge-cli make installThis builds and installs forge to your $GOPATH/bin.
Quick start
New project from scratch
mkdir my-app && cd my-app && git init forge initmkdir my-app && cd my-app && git init forge initThe onboarding asks what language, framework, and project type you want, then generates everything.
Existing project
cd my-existing-project forge initcd my-existing-project forge initForge auto-detects your stack (language, framework, test runner, linter) and generates a harness that matches.
Projects with existing Claude Code configuration
Forge is designed to fit into repos that already have Claude Code set up. It won't clobber your work:
File Behavior on init
CLAUDE.md
Merge — appends forge section inside / delimiters. Your existing instructions stay untouched. Re-init updates only the forge section.
.claude/settings.json
Merge — adds forge permissions and hooks to your existing config. Your custom entries are preserved.
forge.yaml
Skip if exists — your customized commands, risk keywords, and evaluation weights are never overwritten.
.forge/context/project.md
Skip if exists — your project context is yours.
.forge/agents/*
Overwrite — forge-owned agent definitions, updated on every init.*
.forge/pipeline/*
Overwrite — forge-owned pipeline scripts, updated on every init.*
.forge/hooks/*
Overwrite — forge-owned hooks, updated on every init.*
.claude/skills/*/SKILL.md
Overwrite — forge-owned skill prompts. Your own custom skills in other directories are never touched.*
Use --force to overwrite everything, including forge.yaml and project.md.
From a spec document
mkdir my-app && cd my-app && git init forge init --spec ~/Downloads/project-spec.pdfmkdir my-app && cd my-app && git init forge init --spec ~/Downloads/project-spec.pdfForge analyzes the spec with Claude, extracts project metadata (language, modules, architecture, constraints), and pre-fills everything. No manual onboarding needed.
From multiple spec documents
mkdir my-app && cd my-app && git init forge ingest \ ~/Downloads/architecture.md \ ~/Downloads/engineering-backlog.md \ ~/Downloads/database-schema.md forge initmkdir my-app && cd my-app && git init forge ingest \ ~/Downloads/architecture.md \ ~/Downloads/engineering-backlog.md \ ~/Downloads/database-schema.md forge initForge combines multiple documents into one spec, analyzes it, and sets up the harness.
What it does
forge init scans your project, asks a few questions about what you're building, and generates a complete .forge/ harness with:
-
Pipeline scripts — a resumable state machine that routes work through intake, classification, decomposition, plan review, execution, verification, evaluation, and delivery
-
Specialist agents — architect, backend, frontend, quality, security, and visual-qa agents (only the ones your project needs)
-
Tri-agent evaluation — three independent evaluator agents (Edgar, Code Quality, Um-Actually) score every implementation before delivery, with few-shot calibrated scoring
-
Browser testing — Playwright visual smoke tests at mobile + desktop viewports, running automatically when frontend files change
-
Risk classification — T1/T2/T3 tiers that determine how much decomposition and verification a task gets
-
Task tracking — work tracked via bd (Dolt-backed issue tracker for AI agents)
-
Hooks — pre-edit and post-edit hooks that enforce tracked work
-
Skills — /deliver, /ingest, and /skill-creator commands for Claude Code
Usage
After init, open Claude Code in your project:
/deliver — execute tracked work
/deliver "Add JWT authentication with role-based access"
The pipeline:
-
Intake — parses and scores the work description for completeness
-
Classify — assigns risk tier (T1 low, T2 moderate, T3 critical)
-
Decompose — architect agent breaks complex work into parallel-safe subtasks
-
Review plan — independent reviewer validates the decomposition before execution
-
Execute — dispatches subtasks to specialist agents wave-by-wave
-
Verify — runs typecheck, lint, tests, anti-pattern checks, and Playwright browser smoke tests
-
Evaluate — three evaluator agents score the implementation independently (must reach 0.7 composite to pass, max 3 revision iterations)
-
Deliver — creates branch, commits, pushes, opens PR
/ingest — decompose a spec into a project plan
# Add specs to your project forge ingest ~/Downloads/platform-spec.pdf# Add specs to your project forge ingest ~/Downloads/platform-spec.pdfOr multiple documents at once
forge ingest architecture.md backlog.md schema.md
Then in Claude Code:
/ingest spec-a1b2`
Multi-pass analysis:
-
Extract — sections, requirements, constraints, data entities
-
Map domains — group into modules with dependency graph
-
Decompose — epics, features, atomic tasks
-
Identify skills — find repeated patterns worth automating
You review and refine the plan before any code is written. Then execute phase-by-phase through /deliver.
forge run — auto-pilot execution
# Seed tasks from an approved spec forge seed spec-a1b2# Seed tasks from an approved spec forge seed spec-a1b2Execute all tasks with parallel workers
forge run spec-a1b2 --concurrency 3
Dry run to see the plan
forge run spec-a1b2 --dry-run
Budget-limited execution
forge run spec-a1b2 --budget 50`
Auto-pilot features:
-
Parallel workers in isolated git worktrees
-
Circuit breaker — pauses on repeated rate limits
-
Idle timeout — kills hung tasks after configurable seconds
-
Phase gates — optional human review between phases
-
Budget control — per-task spend limits
-
Resumable — interrupted runs pick up where they left off
/skill-creator — generate custom skills
/skill-creator
Create new Claude Code skills for domain-specific workflows. The ingestion system can also auto-generate skills from patterns it finds in your spec.
Commands
Command Description
forge init
Initialize harness in current project
forge init --spec
Initialize from a spec document (PDF, markdown, text)
forge ingest
Add one or more spec documents for analysis
forge seed
Create beads tasks from an approved spec decomposition
forge run
Auto-pilot task execution with parallel workers
forge run-status
Check status of a running auto-pilot execution
forge add
Install an addon (compliance-hipaa, compliance-soc2)
forge remove
Remove an addon
forge status
Show harness status, agents, addons
forge doctor
Diagnose harness health
forge upgrade
Upgrade harness files to latest version
Tri-agent evaluation
Every implementation is reviewed by three independent evaluator agents before delivery. This follows Anthropic's generator-evaluator separation principle — the agents that write code never evaluate their own output.
Evaluator Weight Focus
Edgar 35% Adversarial edge cases — robustness, error handling, security surface, brittleness
Code Quality 35% Architecture fit, maintainability, performance, correctness beyond tests
Um-Actually 30% API correctness, framework conventions, documentation alignment, upgrade safety
Each evaluator scores four dimensions from 0.0 to 1.0. The composite score must reach 0.7 to pass. If it doesn't, a revision brief is generated and the pipeline loops back to execution — up to 3 iterations before failing.
All evaluator prompts include few-shot calibration examples with full score breakdowns, ensuring consistent and well-anchored judgment across runs.
Browser testing
Playwright visual smoke tests run automatically during verification when frontend files are modified. No addon required — browser testing is built into the core pipeline.
-
Mobile viewport: 375x812px
-
Desktop viewport: 1440x900px
-
Tests affected routes based on git diff
-
Checks: HTTP status, horizontal overflow (mobile), screenshot capture
-
Results written to .forge/state/screenshots/results.json
-
Screenshots fed to evaluators as runtime evidence
Evaluators receive browser test results alongside the code diff, so they can assess visual regression, layout issues, and responsive behavior — not just static code quality.
What gets generated
my-project/ ├── forge.yaml # Main configuration ├── CLAUDE.md # Agent instructions ├── .claude/ │ ├── settings.json # Permissions and hooks │ └── skills/ │ ├── deliver/SKILL.md # /deliver command │ ├── ingest/SKILL.md # /ingest command │ └── skill-creator/SKILL.md # /skill-creator command └── .forge/ ├── pipeline/ # State machine scripts │ ├── orchestrator.sh │ ├── intake.sh │ ├── classify.md │ ├── decompose.md │ ├── review-plan.md │ ├── execute.md │ ├── verify.sh │ ├── evaluate.md │ ├── browser-smoke.sh │ └── deliver.sh ├── agents/ # Specialist agent definitions │ ├── architect.md │ ├── backend.md # (if project needs backend) │ ├── frontend.md # (if project needs frontend) │ ├── quality.md │ ├── security.md │ ├── visual-qa.md │ ├── edgar.md # Evaluator: edge cases │ ├── code-quality.md # Evaluator: architecture │ └── um-actually.md # Evaluator: best practices ├── context/ # Project knowledge │ ├── stack.md # Tech stack conventions │ └── project.md # Your project context ├── hooks/ # Claude Code lifecycle hooks │ ├── pre-edit.sh │ ├── post-edit.sh │ └── session-start.sh ├── specs/ # Ingested spec documents ├── addons/ # Installed addon files └── state/ # Screenshots, transient state .beads/ # bd task tracking (Dolt database)my-project/ ├── forge.yaml # Main configuration ├── CLAUDE.md # Agent instructions ├── .claude/ │ ├── settings.json # Permissions and hooks │ └── skills/ │ ├── deliver/SKILL.md # /deliver command │ ├── ingest/SKILL.md # /ingest command │ └── skill-creator/SKILL.md # /skill-creator command └── .forge/ ├── pipeline/ # State machine scripts │ ├── orchestrator.sh │ ├── intake.sh │ ├── classify.md │ ├── decompose.md │ ├── review-plan.md │ ├── execute.md │ ├── verify.sh │ ├── evaluate.md │ ├── browser-smoke.sh │ └── deliver.sh ├── agents/ # Specialist agent definitions │ ├── architect.md │ ├── backend.md # (if project needs backend) │ ├── frontend.md # (if project needs frontend) │ ├── quality.md │ ├── security.md │ ├── visual-qa.md │ ├── edgar.md # Evaluator: edge cases │ ├── code-quality.md # Evaluator: architecture │ └── um-actually.md # Evaluator: best practices ├── context/ # Project knowledge │ ├── stack.md # Tech stack conventions │ └── project.md # Your project context ├── hooks/ # Claude Code lifecycle hooks │ ├── pre-edit.sh │ ├── post-edit.sh │ └── session-start.sh ├── specs/ # Ingested spec documents ├── addons/ # Installed addon files └── state/ # Screenshots, transient state .beads/ # bd task tracking (Dolt database)Supported stacks
Forge auto-detects and has presets for:
-
TypeScript/JavaScript — Next.js, SvelteKit
-
Python — FastAPI, Django, Flask
-
Go — Gin, Chi, Fiber
Works with any project regardless of stack — presets just provide stack-specific conventions and verification commands.
Addons
forge add compliance-hipaa # HIPAA security checks forge add compliance-soc2 # SOC2 compliance verificationforge add compliance-hipaa # HIPAA security checks forge add compliance-soc2 # SOC2 compliance verificationRequirements
-
Git
-
bd (task tracking — brew install beads)
-
jq (JSON processing in pipeline scripts)
-
gh CLI (PR creation)
-
Claude Code (to run the generated harness)
-
Node.js 18+ and Playwright (auto-installed for browser testing)
Development
# Build make build# Build make buildInstall locally
make install
Run tests
make test
Run linter
make vet`
Releases are automated via GoReleaser — push a version tag (git tag v0.3.0 && git push --tags) to trigger a GitHub release with cross-platform binaries and Homebrew tap update.
License
MIT
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
github
I Built Consistent Hashing From Scratch in Go — Here's What I Learned
If you've ever added a server to a cache cluster and watched your database melt, you already know the problem consistent hashing solves. You just might not know it by name. I built a full implementation from scratch in Go to understand it deeply. This post walks through what I learned — the problem, the fix, and the gotchas nobody tells you about. The five-minute version You have 5 cache servers. You route keys with hash(key) % 5 . Life is good. Then traffic spikes and you add a 6th server. Now it's hash(key) % 6 . Sounds harmless, right? Here's what actually happens: Before: hash("user:1001") % 5 = 3 → Server C After: hash("user:1001") % 6 = 1 → Server A ← moved! That key was sitting happily on Server C. Now every client thinks it's on Server A, where it doesn't exist. Cache miss. The req

Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Open Source AI
Google - Gemma 4 now in Unsloth!
Google releases Gemma 4 with four new models: E2B, E4B, 26B-A4B, 31B. You can now run and train the Gemma 4 models in Unsloth. Guide / Blog: https://unsloth.ai/docs/models/gemma-4 The multimodal reasoning models are licensed under Apache 2.0. Run E2B and E4B on 6GB RAM, and on phones. Run 26B-A4B and 31B on ~18GB. GGUFs: https://huggingface.co/collections/unsloth/gemma-4 Updates Tool calls for smaller models are now more stable and don't cut off anymore Context length is now properly applied. Tool calls for all models are now +30% to +80% more accurate. Web search now actually gets web content and not just summaries Number of tool calls allowed are increased to 25 from 10 Tool calls now terminate much better, so looping / repetitions will be reduced More tool call healing and de-duplicatio


Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!