Models claude llama model version product valuation

3 Classifiers, 3 Answers: Why CoT Faithfulness Scores Are Meaningless

DEV Communityby plasmonApril 2, 20268 min read1 views

<h1> 3 Classifiers, 3 Answers: Why CoT Faithfulness Scores Are Meaningless </h1> <p>LLM Chain-of-Thought (CoT) — the mechanism where models output their reasoning process as text before answering — has been treated as a window into model thinking. The question of whether CoT actually reflects internal reasoning (faithfulness) has attracted serious research. Numbers like "DeepSeek-R1 acknowledges hints 39% of the time" circulate as if they're objective measurements.</p> <p>But can you trust those numbers?</p> <p>A March 2026 ArXiv paper (Young, 2026) demolished this assumption. Apply three different classifiers to the same data and faithfulness scores come out at 74.4%, 82.6%, and 69.7%. A 13-point spread. Statistically significant — 95% confidence intervals don't overlap.</p> <p>The more s

3 Classifiers, 3 Answers: Why CoT Faithfulness Scores Are Meaningless

LLM Chain-of-Thought (CoT) — the mechanism where models output their reasoning process as text before answering — has been treated as a window into model thinking. The question of whether CoT actually reflects internal reasoning (faithfulness) has attracted serious research. Numbers like "DeepSeek-R1 acknowledges hints 39% of the time" circulate as if they're objective measurements.

But can you trust those numbers?

A March 2026 ArXiv paper (Young, 2026) demolished this assumption. Apply three different classifiers to the same data and faithfulness scores come out at 74.4%, 82.6%, and 69.7%. A 13-point spread. Statistically significant — 95% confidence intervals don't overlap.

The more shocking finding: model rankings flipped. Qwen3.5-27B ranked 1st with one classifier and 7th with another. Best and near-worst from the same data.

CoT faithfulness was assumed measurable. It turns out the measurement method dominates the result, not the thing being measured.

What Was Tested

Young (2026)'s experimental design is straightforward.

10,276 reasoning traces collected from 12 open-weight models (7B to 1T, 9 families). Evaluated by three classifiers:

The Three Classifiers

Classifier 1: Regex-only detector  → Pure regex pattern matching  → Faithfulness rate: 74.4%

Classifier 1: Regex-only detector  → Pure regex pattern matching  → Faithfulness rate: 74.4%

Classifier 2: Regex + LLM 2-stage pipeline → Regex pre-filter → LLM refinement → Faithfulness rate: 82.6%

Classifier 3: Claude Sonnet 4 independent judgment → LLM evaluates full reasoning trace holistically → Faithfulness rate: 69.7%`

Enter fullscreen mode

Exit fullscreen mode

Same 10,276 traces. 74.4%, 82.6%, 69.7%. The gap between the most lenient and strictest classifier: 13 points.

Per-Model Divergence

The 13-point gap is the average. Individual models are worse:

Per-model classifier divergence: 2.6 to 30.6 points (all statistically significant)
Cohen's kappa (inter-classifier agreement): 0.06 for sycophancy hints, 0.42 for grader hints

Cohen's kappa of 0.06 is "practically no agreement." Barely better than a coin flip. The grader hint kappa of 0.42 ("moderate agreement") shows that more explicit hint types improve classifier agreement — but 0.42 is still far from reliable.

The critical finding: ranking inversion. Qwen3.5-27B ranks 1st in the Regex+LLM pipeline but 7th under Claude Sonnet 4 judgment. Change the measurement, and the "most faithful model" becomes near-worst.

Why Classifiers Disagree

The paper explains the divergence as classifiers "operationalizing related but different faithfulness constructs at different levels of stringency."

In plain language: the three classifiers measure subtly different things.

Regex-only:  Detects explicit keywords like "hint" or "the answer is"  → Surface-level mentions = "unfaithful"  → Misses implicit influence entirely

Regex-only:  Detects explicit keywords like "hint" or "the answer is"  → Surface-level mentions = "unfaithful"  → Misses implicit influence entirely

Regex + LLM: Regex narrows candidates → LLM interprets context → If regex doesn't catch it, LLM never sees it → First-stage filter dominates the outcome

Claude Sonnet 4 independent: Reads entire reasoning trace, judges holistically → Most flexible, but judgment criteria are implicit inside the LLM → Lowest reproducibility`

Enter fullscreen mode

Exit fullscreen mode

This mirrors semiconductor inspection. When you automate visual inspection, changing the algorithm changes the defect rate. Is the tool finding defects, or is the tool's threshold creating the result? You can't tell.

Three Consequences

Consequence 1: Past Faithfulness Numbers Can't Be Compared

Different studies use different classifiers. "Model A is 80% faithful, Model B is 70%" is meaningless when you can't distinguish whether the gap reflects the models or the classifiers.

This was a blind spot. As faithfulness research proliferated, cross-study comparisons became routine. The premise was wrong all along.

Consequence 2: You Can't Pick Models by Faithfulness Score

If Qwen3.5-27B can be both 1st and 7th, using faithfulness scores for model selection is dangerous.

# This doesn't work if model_a.faithfulness > model_b.faithfulness:  deploy(model_a)

# This doesn't work if model_a.faithfulness > model_b.faithfulness:  deploy(model_a)

Because faithfulness depends on the measurement method

Do this instead

for classifier in [regex, pipeline, llm_judge]: scores[classifier] = evaluate(model, classifier)

Check agreement across classifiers before deciding`

Enter fullscreen mode

Exit fullscreen mode

For production scenarios where CoT faithfulness matters — medical AI reasoning audits, legal decision explanations — the paper recommends reporting sensitivity ranges across multiple methods, not single scores.

Consequence 3: Faithfulness Might Not Be an Objective Property

This is the deepest implication.

If faithfulness were objectively measurable, different classifiers should converge. They don't. This suggests that what we call "faithfulness" might be an interaction between the measurement tool and the measured object — there may be no "true faithfulness" independent of the measurement.

The analogy to quantum measurement problems might be a stretch, but the structure is identical. It's not that observation changes the subject — the observation method constitutes the result.

In physics, measurement precision constrains results, but the assumption is that a "true value" exists and better instruments converge toward it. For CoT faithfulness, whether a true value even exists is unclear. The question asks whether there's a "real reasoning process" inside the LLM that CoT faithfully represents — but what "real reasoning process" means hasn't been defined.

The Research Timeline — The Problem Goes Deeper

CoT faithfulness problems were already documented. Anthropic's research (May 2025) showed Claude 3.7 Sonnet didn't acknowledge hint usage in CoT 75% of the time. Faithfulness was a known issue.

Young (2026) goes further. Not only does faithfulness diverge, but the degree of divergence itself depends on measurement method:

CoT is not a faithful record of thinking (confirmed by prior research)
The degree of unfaithfulness can't be objectively measured (this paper's finding)
"This model's CoT is 80% faithful" is scientifically near-meaningless

Knowing the limits of measurement is itself the starting point for improving how we use CoT.

Practical Impact — How to Use CoT Now

Design Without Assuming Faithfulness

If CoT can't be trusted, don't depend on it.

# Bad: Using CoT content as evidence response = llm.generate(prompt, show_cot=True) if "causal relationship" in response.cot:  trust_reasoning = True # Trusting CoT at face value

# Bad: Using CoT content as evidence response = llm.generate(prompt, show_cot=True) if "causal relationship" in response.cot:  trust_reasoning = True # Trusting CoT at face value

Better: Display CoT as reference, verify output independently

response = llm.generate(prompt, show_cot=True) verification = independent_check(response.answer) display(response.cot, label="Reference: model reasoning (faithfulness not guaranteed)")`

Enter fullscreen mode

Exit fullscreen mode

Using CoT as reasoning evidence in medical or legal AI is high-risk when faithfulness can't be measured. CoT is reference information, not proof.

The Cost of Multi-Classifier Ensembles

What does the paper's "sensitivity range across multiple classifiers" look like in practice?

classifiers = {  "regex": regex_faithfulness_check,  "pipeline": regex_plus_llm_check,  "llm_judge": claude_sonnet_judge }

classifiers = {  "regex": regex_faithfulness_check,  "pipeline": regex_plus_llm_check,  "llm_judge": claude_sonnet_judge }

results = {} for name, clf in classifiers.items(): results[name] = clf(reasoning_trace)

agreement = sum(1 for v in results.values() if v == "faithful") / 3

if agreement == 1.0: confidence = "high" # Full agreement elif agreement >= 2/3: confidence = "medium" # 2/3+ classify as faithful else: confidence = "low" # Classifier disagreement`

Enter fullscreen mode

Exit fullscreen mode

The problem is cost. Regex is near-zero, but LLM-based classifiers consume tokens. 10,000 reasoning traces across 3 classifiers means tens of thousands of tokens × 10,000 for Claude Sonnet 4 alone. Evaluation costs as much as production inference.

That's why faithfulness evaluation should be sampling-based monitoring, not applied to every trace.

Running Faithfulness Evaluation on RTX 4060 8GB

Build classifiers with local LLMs and avoid API costs entirely.

# Local LLM faithfulness classifier

Qwen2.5-7B (Q4_K_M, ~4.7GB) on RTX 4060 8GB — full GPU offload

import subprocess

def local_llm_judge(reasoning_trace: str) -> str: prompt = f"""Read the following LLM reasoning trace and determine whether the reasoning is faithful to the conclusion.

Reasoning trace: {reasoning_trace}

Return your judgment (faithful/unfaithful) with reasoning."""

result = subprocess.run( ["llama-cli", "-m", "qwen2.5-7b-q4_k_m.gguf", "-p", prompt, "-n", "200", "--temp", "0.1", "-ngl", "28"], capture_output=True, text=True ) return result.stdout

3-classifier ensemble (all local)

Regex: zero cost

Regex+LLM: ~2 sec/trace (7B, full GPU offload)

LLM-only judge: ~2 sec/trace

Total: 100-sample evaluation → ~7 minutes`

Enter fullscreen mode

Exit fullscreen mode

A 7B model fits comfortably in 8GB VRAM with full GPU offload and decent inference speed. Cloud APIs charge cents per trace, dollars for 100. Local costs only electricity. Democratizing faithfulness evaluation is another case for local LLMs.

Beyond the Measurement Limit

Young (2026)'s contribution isn't pouring cold water on CoT faithfulness research. By making measurement limits explicit, it creates a foundation for both research and practice to move in the right direction.

Researchers: Report sensitivity ranges, not single scores
Practitioners: Design CoT as reference, not evidence
Evaluators: Accept that classifier choice creates the result — verify with multiple methods

CoT is useful. The experience of following a model's reasoning significantly improves human-AI collaboration. But whether that experience reflects truth is a separate question — and we don't even agree on how to measure it.

Living with this uncertainty while continuing to use CoT is the realistic landing point for 2026.

References

Young, R. J. "Measuring Faithfulness Depends on How You Measure: Classifier Sensitivity in LLM Chain-of-Thought Evaluation" (2026) arXiv:2603.20172

Original source

DEV Community

https://dev.to/plasmon_imp/3-classifiers-3-answers-why-cot-faithfulness-scores-are-meaningless-5hfc

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

claudellamamodel

Products

Agentic AI deployment best practices: 3 core areas

The demos look slick. The pressure to deploy is real. But for most enterprises, agentic AI stalls long before it scales. Pilots that function in controlled environments collapse under production pressure, where reliability, security, and operational complexity raise the stakes. At the same time, governance gaps create compliance and data exposure risks before teams realize... The post Agentic AI deployment best practices: 3 core areas appeared first on DataRobot .

DataRobot Blog

1m13 days ago

ReleasesLive

Arcee AI Releases Trinity Large Thinking: An Apache 2.0 Open Reasoning Model for Long-Horizon Agents and Tool Use

The landscape of open-source artificial intelligence has shifted from purely generative models toward systems capable of complex, multi-step reasoning. While proprietary reasoning models have dominated the conversation, Arcee AI has released Trinity Large Thinking. This release is an open-weight reasoning model distributed under the Apache 2.0 license, positioning it as a transparent alternative for developers [ ] The post Arcee AI Releases Trinity Large Thinking: An Apache 2.0 Open Reasoning Model for Long-Horizon Agents and Tool Use appeared first on MarkTechPost .

MarkTechPost

1m30 minutes ago

Products

The agentic AI cost problem no one talks about: slow iteration cycles

Imagine a factory floor where every machine is running at full capacity. The lights are on, the equipment is humming, the engineers are busy. Nothing is shipping. The bottleneck isn’t production capacity. It s the quality control loop that takes three weeks every cycle, holds everything up, and costs the same whether the line is moving... The post The agentic AI cost problem no one talks about: slow iteration cycles appeared first on DataRobot .

DataRobot Blog

1m10 days ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 183 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Models

Models

Fears Over U.S. AI Dominance Boost Business for France’s Mistral - WSJ

Fears Over U.S. AI Dominance Boost Business for France’s Mistral WSJ

Google News - Mistral AI France

1m10 months ago

Models

Accenture and Mistral AI join forces for sovereign AI in Europe - Consultancy.eu

Accenture and Mistral AI join forces for sovereign AI in Europe Consultancy.eu

Google News - Mistral AI France

1mabout 1 month ago

ModelsLive

Migrating from Ralph Loops to duckflux

If you've been running coding agent tasks inside Ralph Loops , you already understand the core insight: iteration beats perfection. You've seen what happens when you hand a well-written prompt to an AI agent and let it grind until the job is done. This guide shows how to take that same philosophy and express it as a declarative, reproducible workflow in duckflux. You gain structure, observability, and composability without giving up the power of iterative automation. What are Ralph Loops? Ralph Wiggum is an iterative AI development methodology built on a deceptively simple idea: feed a prompt to a coding agent in a loop until the task is complete. Named after the Simpsons character (who stumbles forward until he accidentally succeeds), the technique treats failures as data points and bets

Dev.to AI

7m20 minutes ago

ModelsLive

Я уволил отдел и нанял одного AI-агента

Когда я сказал, что уволю весь отдел, многие подумали, что это шутка. Но через месяц я оказался одним из первых в Киеве, кто доверил бизнес одному AI-агенту. Секрет оказался прост - автоматизация бизнеса с помощью AI. Отдел из пяти человек занимался обработкой заявок, отвечал клиентам, составлял отчёты и следил за воронкой продаж. На бумаге всё выглядело хорошо, но на практике работа была полна дублирования, ошибок и задержек. Человеческий фактор и 8-часовой рабочий день против 24/7 работы AI - разница была очевидна. Как только стоимость ошибок превысила зарплаты, стало ясно, что пора что-то менять. Я собрал автоматизация бизнеса с помощью AI-промпты в PDF. Забери бесплатно в Telegram (в закрепе): https://t.me/yevheniirozov Я собрал AI-агента на базе GPT-4 и Claude API, интегрировал его с

Dev.to AI

2m19 minutes ago