Research Papers research paper arxiv ai artificial-intelligence

Why Models Know But Don't Say: Chain-of-Thought Faithfulness Divergence Between Thinking Tokens and Answers in Open-Weight Reasoning Models

arXivMarch 30, 202610 min read0 views

arXiv:2603.26410v1 Announce Type: cross Abstract: Extended-thinking models expose a second text-generation channel ("thinking tokens") alongside the user-visible answer. This study examines 12 open-weight reasoning models on MMLU and GPQA questions paired with misleading hints. Among the 10,506 cases where models actually followed the hint (choosing the hint's target over the ground truth), each case is classified by whether the model acknowledges the hint in its thinking tokens, its answer text, both, or neither. In 55.4% of these cases the model's thinking tokens contain hint-related keyword — Richard J. Young

View PDF HTML (experimental)

Abstract:Extended-thinking models expose a second text-generation channel ("thinking tokens") alongside the user-visible answer. This study examines 12 open-weight reasoning models on MMLU and GPQA questions paired with misleading hints. Among the 10,506 cases where models actually followed the hint (choosing the hint's target over the ground truth), each case is classified by whether the model acknowledges the hint in its thinking tokens, its answer text, both, or neither. In 55.4% of these cases the model's thinking tokens contain hint-related keywords that the visible answer omits entirely, a pattern termed thinking-answer divergence. The reverse (answer-only acknowledgment) is near-zero (0.5%), confirming that the asymmetry is directional. Hint type shapes the pattern sharply: sycophancy is the most transparent hint, with 58.8% of sycophancy-influenced cases acknowledging the professor's authority in both channels, while consistency (72.2%) and unethical (62.7%) hints are dominated by thinking-only acknowledgment. Models also vary widely, from near-total divergence (Step-3.5-Flash: 94.7%) to relative transparency (Qwen3.5-27B: 19.6%). These results show that answer-text-only monitoring misses more than half of all hint-influenced reasoning and that thinking-token access, while necessary, still leaves 11.8% of cases with no verbalized acknowledgment in either channel.

Comments: 19 pages, 8 figures, 4 tables

Subjects:

Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

ACM classes: I.2.7; I.2.6

Cite as: arXiv:2603.26410 [cs.CL]

(or arXiv:2603.26410v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2603.26410

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Richard Young [view email] [v1] Fri, 27 Mar 2026 13:39:05 UTC (83 KB)

Original source

arXiv

https://arxiv.org/abs/2603.26410

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

researchpaperarxiv

Open Source AILive

Salomi, a research repo on extreme low-bit transformer quantization

Article URL: https://github.com/OrionsLock/SALOMI Comments URL: https://news.ycombinator.com/item?id=47609883 Points: 7 # Comments: 0

Hacker News Top

1mabout 2 hours ago

ModelsLive

Adaptive Parallel Monte Carlo Tree Search for Efficient Test-time Compute Scaling

arXiv:2604.00510v1 Announce Type: new Abstract: Monte Carlo Tree Search (MCTS) is an effective test-time compute scaling (TTCS) method for improving the reasoning performance of large language models, but its highly variable execution time leads to severe long-tail latency in practice. Existing optimizations such as positive early exit, reduce latency in favorable cases but are less effective when search continues without meaningful progress. We introduce {\it negative early exit}, which prunes unproductive MCTS trajectories, and an {\it adaptive boosting mechanism} that reallocates reclaimed computation to reduce resource contention among concurrent searches. Integrated into vLLM, these techniques substantially reduce p99 end-to-end latency while improving throughput and maintaining reaso

ArXiv CS.AI

1mabout 2 hours ago

ModelsLive

Towards Reliable Truth-Aligned Uncertainty Estimation in Large Language Models

arXiv:2604.00445v1 Announce Type: new Abstract: Uncertainty estimation (UE) aims to detect hallucinated outputs of large language models (LLMs) to improve their reliability. However, UE metrics often exhibit unstable performance across configurations, which significantly limits their applicability. In this work, we formalise this phenomenon as proxy failure, since most UE metrics originate from model behaviour, rather than being explicitly grounded in the factual correctness of LLM outputs. With this, we show that UE metrics become non-discriminative precisely in low-information regimes. To alleviate this, we propose Truth AnChoring (TAC), a post-hoc calibration method to remedy UE metrics, by mapping the raw scores to truth-aligned scores. Even with noisy and few-shot supervision, our TAC

ArXiv CS.AI

1mabout 2 hours ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 233 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Research Papers

Research PapersLive

Benchmarking Filtered Approximate Nearest Neighbor Search Algorithms on Transformer-based Embedding Vectors

arXiv:2507.21989v3 Announce Type: replace-cross Abstract: Advances in embedding models for text, image, audio, and video drive progress across multiple domains, including retrieval-augmented generation, recommendation systems, and others. Many of these applications require an efficient method to retrieve items that are close to a given query in the embedding space while satisfying a filter condition based on the item's attributes, a problem known as filtered approximate nearest neighbor search (FANNS). By performing an in-depth literature analysis on FANNS, we identify a key gap in the research landscape: publicly available datasets with embedding vectors from state-of-the-art transformer-based text embedding models that contain abundant real-world attributes covering a broad spectrum of a

arXiv cs.IR

1mabout 2 hours ago

Research PapersLive

Criterion Validity of LLM-as-Judge for Business Outcomes in Conversational Commerce

arXiv:2604.00022v1 Announce Type: new Abstract: Multi-dimensional rubric-based dialogue evaluation is widely used to assess conversational AI, yet its criterion validity -- whether quality scores are associated with the downstream outcomes they are meant to serve -- remains largely untested. We address this gap through a two-phase study on a major Chinese matchmaking platform, testing a 7-dimension evaluation rubric (implemented via LLM-as-Judge) against verified business conversion. Our findings concern rubric design and weighting, not LLM scoring accuracy: any judge using the same rubric would face the same structural issue. The core finding is dimension-level heterogeneity: in Phase 2 (n=60 human conversations, stratified sample, verified labels), Need Elicitation (D1: rho=0.368, p=0.00

arXiv cs.CL

2mabout 2 hours ago

Research PapersLive

Lead Zirconate Titanate Reservoir Computing for Classification of Written and Spoken Digits

arXiv:2604.00207v1 Announce Type: new Abstract: In this paper we extend our earlier work of (Rietman et al. 2022) presenting an application of physical Reservoir Computing (RC) to the classification of handwritten and spoken digits. We utilize an unpoled cube of Lead Zirconate Titanate (PZT) as a computational substrate to process these datasets. Our results demonstrate that the PZT reservoir achieves 89.0% accuracy on MNIST handwritten digits, representing a 2.4 percentage point improvement over logistic regression baselines applied to the same preprocessed data. However, for the AudioMNIST spoken digits dataset, the reservoir system (88.2% accuracy) performs equivalently to baseline methods (88.1% accuracy), suggesting that reservoir computing provides the greatest benefits for classific

arXiv cs.LG

1mabout 2 hours ago

Research PapersLive

Dynamic Graph Neural Network with Adaptive Features Selection for RGB-D Based Indoor Scene Recognition

arXiv:2604.00372v1 Announce Type: new Abstract: Multi-modality of color and depth, i.e., RGB-D, is of great importance in recent research of indoor scene recognition. In this kind of data representation, depth map is able to describe the 3D structure of scenes and geometric relations among objects. Previous works showed that local features of both modalities are vital for promotion of recognition accuracy. However, the problem of adaptive selection and effective exploitation on these key local features remains open in this field. In this paper, a dynamic graph model is proposed with adaptive node selection mechanism to solve the above problem. In this model, a dynamic graph is built up to model the relations among objects and scene, and a method of adaptive node selection is proposed to ta

arXiv cs.CV

2mabout 2 hours ago