Models claude model benchmark version open source service

We tested structured ontology vs Markdown+RAG for AI agents — "why?" recall was 0% vs 100%

Dev.to AIby Martin ArvaApril 4, 20266 min read0 views

Our AI agent knew the company uses Provider A for identity verification. It could name the provider, list the integration specs, recite the timeline. Then we asked why Provider A was chosen over Provider B. The agent couldn't answer. Not once across 24 attempts. Zero percent recall on reasoning questions. So we built the layer that was missing — and ran 48 controlled experiments to measure the difference. The problem: AI agents can't answer "why?" If you give an AI agent a folder of Markdown docs and let it use RAG to find answers, it handles factual questions well. What modules exist? Who owns this component? When was this decision made? But "why?" is different. Reasoning is rarely stored as a discrete fact. It's spread across meeting notes, scattered through Slack threads, buried in the

Our AI agent knew the company uses Provider A for identity verification. It could name the provider, list the integration specs, recite the timeline.

Then we asked why Provider A was chosen over Provider B.

The agent couldn't answer. Not once across 24 attempts. Zero percent recall on reasoning questions.

So we built the layer that was missing — and ran 48 controlled experiments to measure the difference.

The problem: AI agents can't answer "why?"

If you give an AI agent a folder of Markdown docs and let it use RAG to find answers, it handles factual questions well. What modules exist? Who owns this component? When was this decision made?

But "why?" is different.

Reasoning is rarely stored as a discrete fact. It's spread across meeting notes, scattered through Slack threads, buried in the third paragraph of a design doc written six months ago. The connection between a strategic goal and an operational decision almost never appears as a single retrievable chunk.

This means vector search finds the documents that mention the decision, but not the reasoning chain that justifies it. The agent knows what happened. It doesn't know why.

This matters more than it sounds. An agent that doesn't understand why a decision was made will make follow-up decisions that are technically correct but institutionally wrong — optimizing for the wrong goal, violating an unwritten constraint, repeating a mistake that was already analyzed and rejected.

Our approach: structured ontology as a navigation layer

We didn't replace the Markdown docs. We added a structured layer on top — a four-level ontology that maps business reasoning into queryable relationships:

LORE (foundational beliefs, worldview)  ↓ interpreted_into VISION (goals, priorities, boundaries)  ↓ operationalized_into RULES (policies, decision rules, constraints)  ↓ applied_to OPERATIONS (initiatives, decisions, tasks)

LORE (foundational beliefs, worldview)  ↓ interpreted_into VISION (goals, priorities, boundaries)  ↓ operationalized_into RULES (policies, decision rules, constraints)  ↓ applied_to OPERATIONS (initiatives, decisions, tasks)

Enter fullscreen mode

Exit fullscreen mode

Every connection between layers carries an assertion — an explicit explanation of why that relationship exists. This means an agent can trace from any operational decision back to the foundational beliefs that justify it.

Here's what that looks like in practice. Ask: "Why did we choose Provider A for identity verification?"

The agent traces:

OPERATIONS → Chose Provider A (affordable, OIDC-compatible)  ← applied_to RULES → Start with affordable identity provider, plan migration later  ← operationalized_into VISION → Build self-service tools for micro-entrepreneurs  ← interpreted_into LORE → Small business owners want to handle accounting themselves

OPERATIONS → Chose Provider A (affordable, OIDC-compatible)  ← applied_to RULES → Start with affordable identity provider, plan migration later  ← operationalized_into VISION → Build self-service tools for micro-entrepreneurs  ← interpreted_into LORE → Small business owners want to handle accounting themselves

Enter fullscreen mode

Exit fullscreen mode

No vector search. No probabilistic retrieval. SQL queries over a versioned database.

The backend is Dolt — a database with Git semantics. Branch, commit, diff, merge, pull request. Every change to the ontology goes through human review before it becomes canonical.

The interface is MCP (Model Context Protocol) — the de facto standard for connecting AI agents to external tools. Our server exposes 18 tools: 9 for querying, 4 for proposing changes, 3 for generating reasoning envelopes, and 2 for Dolt version control.

The experiment

We tested this on a real business domain — a SaaS company's market expansion project. Same knowledge base, same questions, two modes:

Mode A: Agent gets Markdown documentation + file search tools
Mode B: Agent gets the same knowledge as a structured ontology + Dolt MCP tools

48 sessions. 8 task types. 3 runs per task per mode. Two independent LLM judges (GPT-5.4 and Claude Opus 4.5) evaluated every answer against ground truth.

Results

Metric Markdown + RAG Right Reasons Delta

Entity recall 0.514 0.976 +90%

"Why?" question recall 0.000 1.000 0% → 100%

Reasoning quality (1-5) 1.96 4.33 +121%

Stability (variance) 1.457 0.472 3× more stable

Latency 284.6s 183.8s 35% faster

Pairwise wins 0 20 (4 ties)

The "why?" result is the headline: Mode A scored 0.0 entity recall across all 6 runs on reasoning questions. Not low — zero. Mode B scored 1.0 across all 6 runs. This isn't statistical noise. It's a deterministic gap.

The conventional assumption is that structured retrieval is a tradeoff — better recall but more overhead and higher latency. This experiment showed the opposite: the structured approach was simultaneously more accurate, faster, more stable, and more compact in its answers.

Judge agreement was 83.3%. Average judge confidence was 0.927. The only disagreements were on impact analysis tasks where multiple valid reasoning paths existed.

What we didn't prove (honestly)

Ingest: Getting business knowledge into the ontology was manual. This is the hardest unsolved problem.
Write path: We only tested reading. Agents proposing ontology changes is designed but not yet benchmarked.
Generalization: Tested on one domain (dev planning). Other domains are next.

How knowledge enters the ontology: EPICAL

We're not expecting anyone to manually populate SQL tables. The designed ingest pipeline is called EPICAL:

Source docs → EXTRACT → PONDER → INTERROGATE → CALIBRATE → AUTHENTICATE → LOAD

Enter fullscreen mode

Exit fullscreen mode

The first two stages (Extract and Ponder) are agent-driven — the AI proposes candidate objects and relationships from source documents. Interrogate and Calibrate refine confidence. Authenticate is the human gate — a Dolt diff review, just like a code PR. Only after human approval does knowledge become canonical.

The epistemic boundary is strict: an agent cannot bypass human validation. The promote_candidate tool requires authenticated status.

OPS Contracts: reasoning envelopes for external work

One more concept worth mentioning. When work happens in external systems (Jira, GitHub, CI/CD), the agent can generate an OPS Contract — a reasoning envelope that attaches institutional context to a work item:

generate_ops_contract(  external_work_ref="jira://TASK-123",  description="Prepare annual report for submission",  contract_kind="annual_reporting" )

generate_ops_contract(  external_work_ref="jira://TASK-123",  description="Prepare annual report for submission",  contract_kind="annual_reporting" )

Enter fullscreen mode

Exit fullscreen mode

The contract tells the executing agent why this task matters, what rules apply, and which boundaries must not be crossed — without the agent needing to query the full ontology itself.

Try it

The full repo is open source:

git clone https://github.com/Right-Reasons/right-reasons cd right-reasons docker compose up -d cd mcp-server && pip install -e .

git clone https://github.com/Right-Reasons/right-reasons cd right-reasons docker compose up -d cd mcp-server && pip install -e .

Enter fullscreen mode

Exit fullscreen mode

Connect your agent, then ask:

"Why did we choose Provider A over Provider B for identity? Use the get_explanation_packet tool with object ID ex_ops_02."

The agent will trace the full reasoning chain across all four layers.

📦 GitHub repo
🌐 Website
📝 Background article by Kaspar Loit
📊 Full experiment results

Right Reasons is built by MindWorks Industries. We're looking for early users who want to give their AI agents actual institutional reasoning. Reach out at [email protected].

Original source

Dev.to AI

https://dev.to/martinarva/we-tested-structured-ontology-vs-markdownrag-for-ai-agents-why-recall-was-0-vs-100-42p3

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

claudemodelbenchmark

Models

Can world models unlock general purpose robotics? - Bessemer Venture Partners

Can world models unlock general purpose robotics? Bessemer Venture Partners

Google News - AI robotics

1m11 days ago

ProductsLive

oh-my-claudecode is a Game Changer: Experiencing Local AI Swarm Orchestration

While the official Claude Code CLI has been making waves recently, I stumbled upon a tool that pushes its potential to the absolute limit: oh-my-claudecode (OMC) . More than just a coding assistant, OMC operates on the concept of local swarm orchestration for AI agents . It’s been featured in various articles and repos, but after spinning it up locally, I can confidently say this is a paradigm shift in the developer experience. Here is my hands-on review and why I think it’s worth adding to your stack. Why is oh-my-claudecode so powerful? If the standard Claude Code is like having a brilliant junior developer sitting next to you, OMC is like hiring an entire elite engineering team . Instead of relying on a single AI to handle everything sequentially, OMC leverages multiple specialized agen

DEV Community

5m36 minutes ago

ReleasesLive

50 Hours Building a Next.js Boilerplate So You Can Ship in 30 Minutes!

Next.js Boilerplate: The Ultimate SaaS Starter Kit Looking for the best Next.js Boilerplate to launch your next project? You've found it. This production-ready starter kit is designed to help you go from idea to deployment in record time. Table of Contents The Problem That Kept Me Up at Night Why This Next.js Boilerplate is Different Key Features of Nextjs-Elite-Boilerplate How to Get Started (The Real Way) The Project Structure (Explained for Humans) What You Get Out of the Box Contributing Support Final Thoughts The Problem That Kept Me Up at Night You know that feeling when you start a new Next.js project and spend the first week just setting things up? Authentication, internationalization, role management, SEO configuration... By the time you're done with the boilerplate, you've lost a

DEV Community

8m32 minutes ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 138 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Models

Models

Can world models unlock general purpose robotics? - Bessemer Venture Partners

Can world models unlock general purpose robotics? Bessemer Venture Partners

Google News - AI robotics

1m11 days ago

ModelsFresh

Alibaba's Omni Model: Multilingual, Multimodal, Multisecret - eWeek

Alibaba's Omni Model: Multilingual, Multimodal, Multisecret eWeek

GNews AI Alibaba

1mabout 12 hours ago

ModelsLive

We audited LoCoMo: 6.4% of the answer key is wrong and the judge accepts up to 63% of intentionally

Projects are still submitting new scores on LoCoMo as of March 2026. We audited it and found 6.4% of the answer key is wrong, and the LLM judge accepts up to 63% of intentionally wrong answers. LongMemEval-S is often raised as an alternative, but each question's corpus fits entirely in modern context windows, making it more of a context window test than a memory test. Here's what we found. LoCoMo LoCoMo ( Maharana et al., ACL 2024 ) is one of the most widely cited long-term memory benchmarks. We conducted a systematic audit of the ground truth and identified 99 score-corrupting errors in 1,540 questions (6.4%). Error categories include hallucinated facts in the answer key, incorrect temporal reasoning, and speaker attribution errors. Examples: The answer key specifies "Ferrari 488 GTB," bu

DEV Community

6m30 minutes ago

ModelsLive

AI Safety at the Frontier: Paper Highlights of February & March 2026

tl;dr Paper of the month: A benchmark of 56 model organisms with hidden behaviors finds that auditing-tool rankings depend heavily on how the organism was trained — and the investigator agent, not the tools, is the bottleneck. Research highlights: Linear “emotion vectors” in Claude causally drive misalignment: “desperate” steering raises blackmail from 22% to 72%, “calm” drops it to 0%. Emergent misalignment is the optimizer’s preferred solution — more efficient and more stable than staying narrowly misaligned. Scheming propensity in realistic settings is near 0%, but can dramatically increase from one prompt snippet or tool change. AI self-monitors are up to 5× more likely to approve an action shown as their own prior turn — driven by implicit cues, not stated authorship. Reasoning models

lesswrong.com

18m28 minutes ago