The Model Says Walk: How Surface Heuristics Override Implicit Constraints in LLM Reasoning
arXiv:2603.29025v1 Announce Type: new Abstract: Large language models systematically fail when a salient surface cue conflicts with an unstated feasibility constraint. We study this through a diagnose-measure-bridge-treat framework. Causal-behavioral analysis of the ``car wash problem'' across six models reveals approximately context-independent sigmoid heuristics: the distance cue exerts 8.7 to 38 times more influence than the goal, and token-level attribution shows patterns more consistent with keyword associations than compositional inference. The Heuristic Override Benchmark (HOB) -- 500 instances spanning 4 heuristic by 5 constraint families with minimal pairs and explicitness gradients -- demonstrates generality across 14 models: under strict evaluation (10/10 correct), no model exce
View PDF HTML (experimental)
Abstract:Large language models systematically fail when a salient surface cue conflicts with an unstated feasibility constraint. We study this through a diagnose-measure-bridge-treat framework. Causal-behavioral analysis of the ``car wash problem'' across six models reveals approximately context-independent sigmoid heuristics: the distance cue exerts 8.7 to 38 times more influence than the goal, and token-level attribution shows patterns more consistent with keyword associations than compositional inference. The Heuristic Override Benchmark (HOB) -- 500 instances spanning 4 heuristic by 5 constraint families with minimal pairs and explicitness gradients -- demonstrates generality across 14 models: under strict evaluation (10/10 correct), no model exceeds 75%, and presence constraints are hardest (44%). A minimal hint (e.g., emphasizing the key object) recovers +15 pp on average, suggesting the failure lies in constraint inference rather than missing knowledge; 12/14 models perform worse when the constraint is removed (up to -39 pp), revealing conservative bias. Parametric probes confirm that the sigmoid pattern generalizes to cost, efficiency, and semantic-similarity heuristics; goal-decomposition prompting recovers +6 to 9 pp by forcing models to enumerate preconditions before answering. Together, these results characterize heuristic override as a systematic reasoning vulnerability and provide a benchmark for measuring progress toward resolving it.
Subjects:
Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as: arXiv:2603.29025 [cs.CL]
(or arXiv:2603.29025v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2603.29025
arXiv-issued DOI via DataCite (pending registration)
Submission history
From: Yubo Li [view email] [v1] Mon, 30 Mar 2026 21:36:09 UTC (475 KB)
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
modellanguage modelbenchmark‘That’s a great point!’: Overly agreeable AI models shown to harm people’s judgment - Palo Alto Online
<a href="https://news.google.com/rss/articles/CBMiywFBVV95cUxOa1ZrSUQyY0JEbXEtUDFveWVUMV9SOWxZd05LM1AtOEFkc3d0QlN1X0RuSzd1RGNSM3BCN0pITlpCRUl5UmhWaWpGTXE0Q0ZWcFZqRTA2X1dEcERldk1wZnVWR2hXdGtKUDV0cmxQTVVBNVFDc1FLNXpWM3BYeEI3UE5QQWtvVmhtSmFsV0pqdF9feVhzVHRTbGtuTGNqNjJubGFNWjJ4d2lpUVFtOFA2cm1zYklkQW9vZkRDS2p3blhkaHpZWHItYlIwQQ?oc=5" target="_blank">‘That’s a great point!’: Overly agreeable AI models shown to harm people’s judgment</a> <font color="#6f6f6f">Palo Alto Online</font>

Your agent's guardrails are suggestions, not enforcement
<p>Yesterday, Anthropic's Claude Code source code leaked. The entire safety system for dangerous cybersecurity work turned out to be a single text file with one instruction: <em>"Be careful not to introduce security vulnerabilities."</em></p> <p>That is the safety layer at one of the most powerful AI companies in the world. Just a prompt asking the model nicely to behave.</p> <p>This is not a shot at Anthropic. It is a symptom of something the whole industry is dealing with right now. We have confused guidance with enforcement, and as agents move into production, that distinction is starting to matter a lot.</p> <h2> Why prompt guardrails feel like they work </h2> <p>When you are building an agent in development, prompt-based guardrails seem totally reasonable. You write something like "ne

Understanding Attention Mechanisms – Part 5: How Attention Produces the First Output
<p>In the <a href="https://dev.to/rijultp/understanding-attention-mechanisms-part-4-turning-similarity-scores-into-attention-weights-5aj2">previous article</a>, we stopped at using the <strong>softmax function to scale the scores</strong>.</p> <p>When we scale the values for the first encoded word <strong>“Let’s”</strong> by <strong>0.4</strong>:</p> <p><a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff2mh2c1dzkberz4204ur.png" class="article-body-image-wrapper"><img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff2mh2c1dzkberz4204ur.p
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Models
Anthropic Executive Sees Cowork Agent as Bigger Than Claude Code - Bloomberg.com
<a href="https://news.google.com/rss/articles/CBMitgFBVV95cUxOM0VfSzdRYUNpT21XMlVuNXhsVEY4TUFxM3UzWUJDOEhFcUtJQnhTbjY2VjBXOUw1d1ZOUDRKeHVKMzkta3pFVWRWSGNoQkp3aWVndlRBQlpVUGxVN0ZnQW80OUZnYWN6RlhJWHRjT0V4RVhPcGhxMmE3b3oyVDlUV2RLY0g2NEx4M1dfMXhvTlhPTW50eFR1cEhxcHB3SXpURnRtbDZtZHp6bGQ2Z09IMjZBODBjdw?oc=5" target="_blank">Anthropic Executive Sees Cowork Agent as Bigger Than Claude Code</a> <font color="#6f6f6f">Bloomberg.com</font>
Exclusive | The Sudden Fall of OpenAI’s Most Hyped Product Since ChatGPT - wsj.com
<a href="https://news.google.com/rss/articles/CBMiogNBVV95cUxNdjBYMFhlczBFa0NWeTB6T0tzLWNfcV9GNmMzaFdHR2xKcW5QV1l4NTBIYVZncmZFU0poSWdibHVSREczeFczZFRtNl9FSWl3bG9EMFhuLUJuWFYtOWh6d2tRYmxOODhGSmdmQkdBbURXVklyZU0tYVdSZ1JhMVFFNjRxZTBvVUlLNnpIWEZCT2ZQX0I3cm9FWUNIYTBRMlRuOVYybXZrYUhDay00UzgwUEZHQjFNeTBFOWMtN2RSbXVIU0FNVV9NNUhwTGwwWml1S2o3Nk1wcUU2SEdxWXF5bDhQZ2YyNHJpeGlOS2R5XzluU0ZyblNnS2xjZElVSC1DMXpYeVl6ZERZeTRCVnVPX1VUS2NKRVlGY1hrVmRZUkFXMWlMUEliOVFJMUdXNllYTHRMcGJhbWJrYmZFWi03TjgwMXhLdDVrZWtMTGRFTEtGdFlsb3lhX2pCWFd6eVJiTlRrZ0pCN3dKSTVnLWoxQmZQc25HVlJLOHhnRFBwekw3WG81MURaYUhncmk0aGx0YmRMWVVqNkxhS0kyZmtXbWtn?oc=5" target="_blank">Exclusive | The Sudden Fall of OpenAI’s Most Hyped Product Since ChatGPT</a> <font color="#6f6f6f">wsj.com</font>
Exclusive | The Sudden Fall of OpenAI’s Most Hyped Product Since ChatGPT - wsj.com
<a href="https://news.google.com/rss/articles/CBMiogNBVV95cUxOdFpHbnQyNmZoUzR3dXJDb25HalR1TWp2OWFqVGgyeWRucVVaaUlWaWU1SjJqX0xEcjE5R3g3RGczeU9jRGRvSHc4U0I4RFZickNFZ1k3VnM5VUxQQTFrVXdNWS1IekVRQmZyem40SXpISEVCUEZ6aFd4d2VZVWd3QWQ1UlEtbnJ3MlFQMzdYcWx5dnVqbTNXdXhiOTlScFZzemVUUzFXdEJBTXJaWTJYbXJPLVZCMjVvS1VwVUpndktzNnVpT1AyeFcya0FtbXJqem1idzZIdWhkTWJSV0ktNkduSEV2Wno4VkdGQ3QtdlRTVUhxSXVxNHR2V256ZjVaVG1LZVFIdFhkTnFQUFBuMXcwOUlnSEkxOFRoNWRMdkczRXZqV1V3UG9QcEh4UHJwN0xrUTNYVzkzYmhpSldtc1EzdXJVNW1id2xTSEFwOVplcE1iZVpyUmM1X2R5UnNyckhHQm56eWpCZTRXWW1Hd3N3Ukh1YnpJcjJWV1hZazdqdHZJMlVBTV9RM085TmRSZFRlSlFOTTNhS21NRGVGRjdB?oc=5" target="_blank">Exclusive | The Sudden Fall of OpenAI’s Most Hyped Product Since ChatGPT</a> <font color="#6f6f6f">wsj.com</font>
‘That’s a great point!’: Overly agreeable AI models shown to harm people’s judgment - Palo Alto Online
<a href="https://news.google.com/rss/articles/CBMiywFBVV95cUxOa1ZrSUQyY0JEbXEtUDFveWVUMV9SOWxZd05LM1AtOEFkc3d0QlN1X0RuSzd1RGNSM3BCN0pITlpCRUl5UmhWaWpGTXE0Q0ZWcFZqRTA2X1dEcERldk1wZnVWR2hXdGtKUDV0cmxQTVVBNVFDc1FLNXpWM3BYeEI3UE5QQWtvVmhtSmFsV0pqdF9feVhzVHRTbGtuTGNqNjJubGFNWjJ4d2lpUVFtOFA2cm1zYklkQW9vZkRDS2p3blhkaHpZWHItYlIwQQ?oc=5" target="_blank">‘That’s a great point!’: Overly agreeable AI models shown to harm people’s judgment</a> <font color="#6f6f6f">Palo Alto Online</font>
Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!