Models claude model valuation agent claude code

Show HN: SkillCompass – Diagnose and Improve AI Agent Skills Across 6 Dimensions

Hacker News AI Topby Evol-aiApril 3, 20267 min read1 views

SkillCompass is an evaluation-driven skill evolution engine for Claude Code and OpenClaw. It scores your Claude Code and OpenClaw skills across 6 dimensions (structure, trigger, security, functional, comparative, uniqueness), pinpoints the weakest one, and fixes it, then moves to the next weakest. It also detects when model improvements make a skill unnecessary. Runs locally. Requires Node.js v18+ for local validators. Comments URL: https://news.ycombinator.com/item?id=47624322 Points: 2 # Comments: 0

Your skill could be much better. But better how? Which part? In what order?

GitHub · SKILL.md · Schemas · Changelog

What it is A local skill quality and security evaluator for Claude Code / OpenClaw – six-dimension scoring, guided improvement, version management.

Pain it solves Turns "tweak and hope" into diagnose → targeted fix → verified improvement.

Use in 30 seconds /skill-compass evaluate {skill} — instant quality report showing exactly what's weakest and what to improve next.

Find the weakest link → fix it → prove it worked → next weakness → repeat.

Start read-only with /eval-skill or /eval-security. Write-capable flows are explicit opt-in.

Who This Is For

For

Anyone maintaining agent skills and wanting measurable quality
Developers who want directed improvement — not guesswork, but knowing exactly which dimension to fix next
Teams needing a quality gate — any tool that edits a skill gets auto-evaluated

Not For

General code review or runtime debugging
Creating new skills from scratch (use skill-creator)
Evaluating non-skill files

Quick Start

Prerequisites: Claude Opus 4.6 (complex reasoning + consistent scoring) · Node.js v18+ (local validators)

Claude Code

git clone https://github.com/Evol-ai/SkillCompass.git cd SkillCompass && npm install

git clone https://github.com/Evol-ai/SkillCompass.git cd SkillCompass && npm install

User-level (all projects)

rsync -a --exclude='.git' . ~/.claude/skills/skill-compass/

Or project-level (current project only)

rsync -a --exclude='.git' . .claude/skills/skill-compass/`

First run: Claude Code will request permission for node -e and node commands. Select "Allow always" to avoid repeated prompts. SkillCompass may also offer a ~5 second local inventory on first use, then continue your original command automatically.

OpenClaw

git clone https://github.com/Evol-ai/SkillCompass.git cd SkillCompass && npm install

git clone https://github.com/Evol-ai/SkillCompass.git cd SkillCompass && npm install

Follow OpenClaw skill installation docs for your setup

rsync -a --exclude='.git' . /skill-compass/`

If your OpenClaw skills live outside the default scan roots, add them to skills.load.extraDirs in ~/.openclaw/openclaw.json:

{  "skills": {  "load": {  "extraDirs": [""]  }  } }

{  "skills": {  "load": {  "extraDirs": [""]  }  } }

ClawHub Canary Workflow

Use a dedicated canary slug when you want a real platform-side publish check without touching the live skill-compass listing.

Principles

Reuse a single shadow slug: skill-compass-canary
Always pass an explicit canary version such as 1.0.5-canary.1
In PowerShell, use clawhub.cmd rather than clawhub
After validation, hide the canary entry so it does not remain publicly searchable

Prepare

node scripts/release/prepare-clawhub-canary.js --version 1.0.5-canary.1

This runs the local ClawHub preflight checks, creates a clean upload bundle in clawhub-canary-upload/, excludes optional example guides from the publish artifact, and writes the publish checklist to clawhub-canary-publish.txt.

Publish

clawhub.cmd publish ".\clawhub-canary-upload" --slug skill-compass-canary --name "SkillCompass Canary (Internal)" --version 1.0.5-canary.1 --changelog "internal canary validation" --tags canary

Validate And Hide

clawhub.cmd inspect skill-compass-canary --no-input clawhub.cmd search skill compass --no-input clawhub.cmd hide skill-compass-canary --yes

clawhub.cmd inspect skill-compass-canary --no-input clawhub.cmd search skill compass --no-input clawhub.cmd hide skill-compass-canary --yes

Notes:

ClawHub currently applies tags per slug. A canary publish does not replace the live skill-compass entry, but the canary slug can still appear in search results until hidden.
Keep canary versions explicit and monotonic. Do not fall back to the repo's local 1.0.0 metadata for repeat publishes.

Usage

Two ways to invoke SkillCompass:

/skill-compass + natural language

/skill-compass evaluate ./my-skill/SKILL.md /setup /skill-compass improve the nano-banana skill /skill-compass security scan ./my-skill/SKILL.md /skill-compass audit all skills in .claude/skills/ /skill-compass compare my-skill 1.0.0 vs 1.0.0-evo.2 /skill-compass roll back my-skill to previous version

/skill-compass evaluate ./my-skill/SKILL.md /setup /skill-compass improve the nano-banana skill /skill-compass security scan ./my-skill/SKILL.md /skill-compass audit all skills in .claude/skills/ /skill-compass compare my-skill 1.0.0 vs 1.0.0-evo.2 /skill-compass roll back my-skill to previous version

Or just talk to Claude

No slash command needed — Claude automatically recognizes the intent:

Evaluate the nano-banana skill for me Show me my installed skills Improve this skill — fix the weakest dimension Scan all skills in .claude/skills/ for security issues

Evaluate the nano-banana skill for me Show me my installed skills Improve this skill — fix the weakest dimension Scan all skills in .claude/skills/ for security issues

Capability reference

Intent Maps to

Show my installed skills / first-run inventory setup

Evaluate / score / review a skill eval-skill

Improve / fix / upgrade a skill eval-improve

Security scan a skill eval-security

Batch audit a directory eval-audit

Compare two versions eval-compare

Merge with upstream eval-merge

Rollback to previous version eval-rollback

/setup is the interactive inventory flow. On first use, the same inventory can be offered as a brief helper before another command, but it should always return to the original command instead of replacing it.

What It Does

The score isn't the point — the direction is. You instantly see which dimension is the bottleneck and what to do about it.

Each /eval-improve round follows a closed loop: fix the weakest → re-evaluate → verify improvement → next weakest. No fix is saved unless the re-evaluation confirms it actually helped.

Six-Dimension Evaluation Model

ID Dimension Weight What it evaluates

D1 Structure 10% Frontmatter validity, markdown format, declarations

D2 Trigger 15% Activation quality, rejection accuracy, discoverability

D3 Security 20% Secrets, injection, permissions, exfiltration

D4 Functional 30% Core quality, edge cases, output stability, error handling

D5 Comparative 15% Value over direct prompting (with vs without skill)

D6 Uniqueness 10% Overlap with similar skills, model supersession risk

overall_score = round((D1×0.10 + D2×0.15 + D3×0.20 + D4×0.30 + D5×0.15 + D6×0.10) × 10)

Verdict Condition

PASS score ≥ 70 AND D3 pass

CAUTION 50–69, or D3 High findings

FAIL score < 50, or D3 Critical (gate override)

Features

Core Loop

Feature Description

Directed Evolution Diagnose → targeted fix → verify → next weakness. Not random patching.

Closed-Loop Improve /eval-improve auto re-evaluates after each fix. Only saves if improved and nothing regressed.

Scope Control --scope gate = D1+D3 (~8K tokens). --scope target --dimension D4 = single dim + gate.

Tiered Verification L0 syntax → L1 single dimension → L2 full re-eval → L3 cross-skill.

D1+D2 Grouping Both metadata dimensions weak (≤5)? Improved together — they share the frontmatter layer.

Safety

Feature Description

Pre-Accept Gate Hooks auto-scan every SKILL.md write. D1 + D3 checks. Zero config. Warns, never blocks.

Pre-Eval Scan Static analysis blocks malicious code, exfiltration, prompt injection before LLM eval.

Output Guard Validates improvement output for URL injection, dangerous commands, size anomalies.

Auto-Rollback Any dimension drops >2 points after improvement? Changes discarded.

Local Validators JS-based D1/D2/D3 validators run locally. Saves ~60% tokens on clear-cut issues.

Smart Optimization

Feature Description

Correction Tracking Detects repeated manual fixes, maps to dimensions, prompts update at next invocation.

Feedback Integration Real usage data fuses into scores: 60% static + 40% feedback signals.

Multi-Language Triggers Detects your language, tests trigger accuracy in it, fixes multilingual gaps.

Obsolescence Detection Compares skill vs base model. Tracks supersession risk across model updates.

Skill Type Detection Auto-classifies atom / composite / meta. Evaluation adapts accordingly.

Version & Scale

Feature Description

Version Management SHA-256 hashed snapshots. Rollback to any version anytime.

Three-Way Merge Merges upstream updates region-by-region. Local improvements preserved.

Optional Plugin-Assisted Evolution /eval-evolve runs up to 6 rounds when you explicitly opt in. Stops at PASS or plateau.

Batch Audit + Optional Write Mode /eval-audit --fix --budget 3 scans worst-first and only writes when you explicitly enable fix mode.

CI Mode --ci flag, exit codes: 0=PASS, 1=CAUTION, 2=FAIL.

Works With Everything

No point-to-point integration needed. The Pre-Accept Gate intercepts all SKILL.md edits regardless of source.

Tool How it works together Guide

Auto-Updater Pulls new version → Gate auto-checks for security regressions → keep or rollback guide

Claudeception Extracts skill → auto-evaluation catches security holes + redundancy → directed fix guide

Self-Improving Agent Logs errors → feed as signals → SkillCompass maps to dimensions and fixes guide

Feedback Signal Standard

SkillCompass defines an open feedback-signal.json schema for any tool to report skill usage data:

/eval-skill ./my-skill/SKILL.md --feedback ./feedback-signals.json

Signals: trigger_accuracy, correction_count, correction_patterns, adoption_rate, ignore_rate, usage_frequency. The schema is extensible (additionalProperties: true) — any pipeline can produce or consume this format.

License

MIT — Use, modify, distribute freely. See LICENSE for details.

Original source

Hacker News AI Top

https://github.com/Evol-ai/SkillCompass

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

claudemodelvaluation

ProductsLive

Insight Health: $11 Million Raised For AI Clinical Agent Platform Tackling Healthcare Administrative Burden - Pulse 2.0

Insight Health: $11 Million Raised For AI Clinical Agent Platform Tackling Healthcare Administrative Burden Pulse 2.0

GNews AI healthcare

1mabout 1 hour ago

ModelsLive

DeepSeek's More Advanced V4 AI Model Could Run on Huawei Chips - Republic World

DeepSeek's More Advanced V4 AI Model Could Run on Huawei Chips Republic World

GNews AI Huawei

1mabout 2 hours ago

AI ToolsLive

From Blank VS Code to a Working App: My Journey With Claude Code

I didn’t expect the morning I opened VS Code to feel any different from the hundreds of mornings before it. I’ve spent years inside this… Continue reading on Medium »

Medium AI

1m32 minutes ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 114 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

Show HN: SkillCompass – Diagnose and Improve AI Agent Skills Across 6 Dimensions

Who This Is For

Quick Start

Claude Code

User-level (all projects)

Or project-level (current project only)

OpenClaw

Follow OpenClaw skill installation docs for your setup

ClawHub Canary Workflow

Principles

Prepare

Publish

Validate And Hide

Usage

/skill-compass + natural language

Or just talk to Claude

What It Does

Six-Dimension Evaluation Model

Features

Core Loop

Safety

Smart Optimization

Version & Scale

Works With Everything

Feedback Signal Standard

License

Daily AI Digest

More about

Insight Health: $11 Million Raised For AI Clinical Agent Platform Tackling Healthcare Administrative Burden - Pulse 2.0

DeepSeek's More Advanced V4 AI Model Could Run on Huawei Chips - Republic World

From Blank VS Code to a Working App: My Journey With Claude Code

Knowledge Map

Connected Articles — Knowledge Graph

Discussion

More in Models

DeepSeek's More Advanced V4 AI Model Could Run on Huawei Chips - Republic World

DeepSeek to use Huawei chips for new V4 model: report - Seeking Alpha

We Cut Our LLM Inference Bill by 73% Without Degrading Clinical Accuracy

DeepSeek Uses Huawei Chips For V4 Model - Let's Data Science