Releases model version update platform valuation compliance

HarnessOS: scaffold/middleware for infinite autonomous tasks — built on Harness Engineering

DEV Communityby Jaewon JangApril 1, 20265 min read0 views

There's a concept gaining traction in AI systems engineering: Harness Engineering. Not the testing tool. The idea: raw LLM capability is like raw power — high voltage, hard to control, dangerous to run indefinitely. Harness Engineering is the discipline of building the control structures that make that power usable at scale. Context managers. Evaluation loops. Failure classifiers. Goal trackers. Memory tiers. I think it's going to be one of the defining disciplines of serious AI systems work. And I've been building a platform around it. <h2> What I Built </h2> HarnessOS is a scaffold/middleware system for running infinite autonomous tasks. The key word is infinite. Not one task. Not one session. An a

There's a concept gaining traction in AI systems engineering: Harness Engineering.

Not the testing tool. The idea: raw LLM capability is like raw power — high voltage, hard to control, dangerous to run indefinitely. Harness Engineering is the discipline of building the control structures that make that power usable at scale. Context managers. Evaluation loops. Failure classifiers. Goal trackers. Memory tiers.

I think it's going to be one of the defining disciplines of serious AI systems work. And I've been building a platform around it.

What I Built

HarnessOS is a scaffold/middleware system for running infinite autonomous tasks.

The key word is infinite. Not one task. Not one session. An agent that:

Runs continuously, across context window rotations
Evolves its own goals when it succeeds at the current one
Persists state across sessions without losing context
Classifies its own failures and routes them appropriately

This is the architecture:

HarnessOS ├── CTX ← context precision layer │ └── LLM-free retrieval, 5.2% token budget, R@5=1.0 dependency recall ├── omc-live ← finite outer loop │ └── 2-Wave strategy + self-evolving goals + episode memory ├── omc-live-infinite ← infinite outer loop │ └── context rotation, world model, no iteration cap ├── HalluMaze ← hallucination management (in development) └── [future layers]  ├── Evaluation Layer  ├── Safety Layer  └── Memory Tier System

HarnessOS ├── CTX ← context precision layer │ └── LLM-free retrieval, 5.2% token budget, R@5=1.0 dependency recall ├── omc-live ← finite outer loop │ └── 2-Wave strategy + self-evolving goals + episode memory ├── omc-live-infinite ← infinite outer loop │ └── context rotation, world model, no iteration cap ├── HalluMaze ← hallucination management (in development) └── [future layers]  ├── Evaluation Layer  ├── Safety Layer  └── Memory Tier System

Enter fullscreen mode

Exit fullscreen mode

The Problem with Current Agent Frameworks

Most agent frameworks are built for tasks that complete in one session.

Spin up → run → done.

That's fine for demos. It breaks for real autonomous work:

Context exhaustion: At ~70% context capacity, agents start losing earlier decisions. Not gracefully. They cliff-edge — sudden degradation, not gradual fade.
No goal evolution: An agent that succeeds at "write tests" has no mechanism to ask "what's the next improvement?" It just stops.
Failure is terminal: Most frameworks catch exceptions. Few classify them — transient vs persistent vs fundamental goal mismatch.

HarnessOS is built specifically to address all three.

What I Measured (The Empirical Foundation)

Before building anything, I ran controlled experiments on questions I couldn't find good empirical answers to anywhere else.

Q1: How should autonomous agents reason about problems?

Compared hypothesis-driven debugging (observe → hypothesize → verify) against engineering-only (pattern match → retry) on 12 bug scenarios.

Bug type Engineering Hypothesis Delta

Simple 1.0 attempts 1.0 attempts none

Causal 1.75 attempts 1.0 attempts -43%

Assumption 2.0 attempts 1.0 attempts -50%

First-hypothesis accuracy: 100%. This is now the default reasoning strategy in omc-live.

Q2: Where do context limits actually hit?

Measured Lost-in-the-Middle across 1K/10K/50K/100K token contexts.

Key finding: degradation is threshold-based, not gradual.

Agents don't slowly forget. They cliff-edge at a specific token length and fail silently. This changed how omc-live-infinite handles context — it monitors budget and triggers a safe rotation handoff at 70%, before the cliff.

Q3: Where do autonomous agents actually fail?

OpenHands on 20-step coding tasks. Failure clusters:

Wrong task decomposition (incorrect sub-goals from the start)
Role non-compliance (agent exceeds defined scope)
Boundary violations (unexpected state mutations)

Predictable = preventable. The omc-failure-router classifies failures into these categories and routes them appropriately instead of generic retry.

The Architecture in Practice

omc-live: Finite Self-Evolving Loop

Wave 1: Strategy consultation (specialist agents, runs once)  ↓ Wave 2: Execution loop  ↓ Judgment: Goal achieved?  ├── NO → update goal tree, retry  └── YES → Score (5 dimensions)  ├── delta ≥ epsilon → EVOLVE goal, continue  └── plateau × 3 → CONVERGED, stop

Wave 1: Strategy consultation (specialist agents, runs once)  ↓ Wave 2: Execution loop  ↓ Judgment: Goal achieved?  ├── NO → update goal tree, retry  └── YES → Score (5 dimensions)  ├── delta ≥ epsilon → EVOLVE goal, continue  └── plateau × 3 → CONVERGED, stop

Enter fullscreen mode

Exit fullscreen mode

When the system succeeds, it scores the output, finds the weakest dimension, generates an elevated goal, and continues — until quality plateaus.

omc-live-infinite: No Iteration Cap

New mechanisms beyond the finite version:

Context rotation: at 70% budget → save state → fresh session → resume
World model: epistemic state layer that persists across rotations
Co-evolution feedback: strategy outcomes feed back into Wave 1 planning

Enables agents that work on complex goals for hours, not seconds.

CTX: Precision Context Loading

Query classification → retrieval strategy selection:

EXPLICIT_SYMBOL → direct lookup
SEMANTIC_FUNCTIONALITY → embedding search
STRUCTURAL_RELATIONSHIP → dependency graph
RECENT_CHANGE → git recency

Result: 5.2% average token budget, R@5=1.0. No LLM calls for retrieval.

Why "Harness Engineering" Is the Right Frame

A harness doesn't constrain power — it channels it.

LLMs have enormous capability. Without control structure, that capability is: context-unaware, goal-unstable, failure-opaque, session-local.

HarnessOS adds the control structure. Not to limit the model — to make it usable for work that spans hours, not seconds.

Current State & Quick Start

214 tests, 100% coverage. CTX and omc-live/infinite are stable and used daily.

git clone https://github.com/jaytoone/HarnessOS python3 analyze.py --run

git clone https://github.com/jaytoone/HarnessOS python3 analyze.py --run

Enter fullscreen mode

Exit fullscreen mode

No pip install. No required API keys for base experiments.

GitHub: https://github.com/jaytoone/HarnessOS

If you're building autonomous agents and thinking about long-run reliability — happy to compare notes.

Original source

DEV Community

https://dev.to/jaewon_jang_d63fddcf69ac2/harnessos-scaffoldmiddleware-for-infinite-autonomous-tasks-built-on-harness-engineering-50n0

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

modelversionupdate

ProductsLive

Slack's upgraded AI can analyze how you work

Salesforce has unveiled the newest version of Slack, which comes with a whole host of new AI features to add to its ever-growing catalog. Naturally, many of these tools are embedded into Slackbot, which the company had already pledged to turn into a “personalized AI companion.” The new features include the stuff you’d expect, such as transcription, note taking and deep research, as well as integrations with the rest of the Salesforce family. But it’ll also get “reusable skills,” which sound a lot like automations, where a team can define a task from start to finish, and then the bot will run it whenever it’s called for. In fact, there’s a big focus on sharing and co-developing tools within organizations, so if you find a prompt that actually gets useful data out of Slackbot, you can share

Engadget

2m4 minutes ago

Market NewsRecent

OpenAI is now bringing in $2 billion a month — and 3 more highlights from its latest update - MarketWatch

<a href="https://news.google.com/rss/articles/CBMi0ANBVV95cUxPV3VhZjlUcVZqSkw2RTFISVZ0Z25MX1Q3Tmh3OFFucUF4eDVxUHpLZVA4aGg0amttazZSYk5hWFNYVlJqOElrR1l0ejgweVJZMVdFTWlEQzhpMWFHR0FSTDFvdUFnTEVYRE1XdEE0T3pQeGc4WXVleERTbEFUQkFrQWNfSjlUS3RqTlNRbEVlWHZrN3laMlFMZDAwV210eEMwY3FfWjktWWRJdElSazB4NEJwVzZjNVhCRExBTFhhdVF4NmE5d1JwWDFSekRmTGdXRFUwNElHenlDcTZmOEdDUzhlcWpsUnZ2LXlmVXlpYmZkQ2pRRUtWdk84ZDBkWks4NVkxM0ZZZXl3QjR6ajRtZVFHMlBad1NpWWo2ZXExeHA2N3g1bUJkeFpERld2ZGM5WXlfZGxQY1RORTJod3F1eUNOZnFXNVdud2VwV29QaGJkcW1ObG1aSklfc245M0hVendXYjhWN2szNzBlTnEyS3Z2VnVCcEZ4a1JVZ2xpbElVTUZQNjZiU0g0SjBNdFVpWkRFS3U3ZXYyRDJsTllHWjZmU2xoYlRrLTNTR2tKN2xQRHFmRlA3ejhubzJQV2ljSGFDaml6VzduX2FyMjdJYw?oc=5" target="_blank">OpenAI is now bringing in $2 billion a month — and 3 more highlights from its latest update</a> <font col

Google News: OpenAI

1mabout 12 hours ago

ModelsRecent

Exclusive | Caltech Researchers Claim Radical Compression of High-Fidelity AI Models - WSJ

<a href="https://news.google.com/rss/articles/CBMiuANBVV95cUxOOGItWGdQMGlaeThwcnNieE40a05Ed19xRnlsYm9XdXdhTk9yc2NvWDU3Y3EwTDZNb2kzQ1Z1T0RIcEZlZDBNRTNTUG1LQ0I4cVZEQUZEcXhqREk5RllkS3EzTU10eHNrcWFDQ0lEdmYxaTBNMi1iMWMxS0t3bHpOSWt2bXhQcGxMOFFEWVNhN29HNVpmVkIwVXc2UmEtOFBpMnJvMi1rRkVxcWt2LWVvdnItNkllR1gyano3VEpPQVZNdTZWRkFvcWdWRm5QSmtTYmE2ajhnQ0picndQNE0xd21fdG1RWFpjcFFiRHh2OVRYMTVBclg2RTJvM1ZxZ0lmOWJrQV91S05kbjhXM3B6Z3BMeGJiY2tzMVNpNzZaNW91QjM4VFJIazVFOFRGSDJBdWg4ZXY3VUVvQnhkc3BIWjU2d0M3Uk1PdXZjOFR0QnplZmJQQkl6QVhNaEgzUG53RWwxdmRFNng5Q3pqc0hBaHNrTFVobk1iVV9PNnh2NEpGVEZ0TGZrUWNEZjlWMDNjWVZyV0UzZndPZ2I1dUtZWDN5TFJMUklkc1hoNWYwMEpRMkZJbEwxcQ?oc=5" target="_blank">Exclusive | Caltech Researchers Claim Radical Compression of High-Fidelity AI Models</a> WSJ

Google News: LLM

1mabout 17 hours ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 159 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Releases

ReleasesLive

The history of Apple in photos, from the early Steve Jobs era to the iPhone launch to its 50-year mark

Apple marks 50 years since its founding, highlighting its evolution from the iMac to the iPhone, and its leaders, including Steve Jobs and Tim Cook.

Business Insider

15mabout 1 hour ago

ReleasesLive

Watch the first crewed Artemis mission take flight

Artemis II, the first crewed mission under the Artemis program , is scheduled to launch today, April 1. NASA is opening a two-hour window for its lift off, starting at 6:24 PM Eastern time, at the Kennedy Space Center in Florida. The agency said the forecast for launch day “shows an 80 percent chance of favorable weather conditions” and that, on March 31, its engineers had finished critical health checks on the Space Launch System rocket that the mission will use. On the evening of March 31, the engineers shifted the launch system into its final configuration. In the early hours of April 1, they will activate the ground launch sequencer. You can start watching Artemis II’s launch event at 7:45AM ET when the Artemis team will load propellant into the SLS rocket. Full launch coverage begins

Engadget

2mabout 1 hour ago

ReleasesLive

BOE Warns on Escalating Risks From AI, Fallout From Iran War

Financial institutions’ use of artificial intelligence could increase rapidly and become a financial stability threat, the Bank of England warned on Wednesday, as it also called out the potential for AI to trigger shocks in the private credit markets that ricochet more broadly.

Bloomberg Technology

1mabout 1 hour ago

Releases

Meituan to Unlock AI Programming Capabilities and Launch New NoCode Product | Exclusive from 36Kr - 36 Kr

<a href="https://news.google.com/rss/articles/CBMiU0FVX3lxTE55WTNLSjZfb1pqa2hpb3k5c1cxLU9iaWExSV90eEFRNS1ISGxuVnFhUTV5alF5U19YOWwtTEJ3eGRNTDBwTERhbjIwbkx2R2tvTUpV?oc=5" target="_blank">Meituan to Unlock AI Programming Capabilities and Launch New NoCode Product | Exclusive from 36Kr</a> 36 Kr

Google News - Meituan AI

1m11 months ago