Live
Black Hat USAAI BusinessBlack Hat AsiaAI BusinessFirst-Time Payees, Payouts, and Why Clean Transactions Still Turn Into Fraud LossesDEV CommunityHandling Extreme Class Imbalance in Fraud DetectionDEV CommunityAntropic's Claude Code leaked and Axios NPM InflitrationDEV CommunityReal-Time Fraud Scoring Latency: What 47ms Actually MeansDEV CommunityPause, Save, Resume: The Definitive Guide to StashingDEV CommunitySouth Korean trade data: chip shipments hit a record-high value of $32.83B in March 2026, up 151.4% YoY, pushing total exports to a record $86.13B, up 48.3% YoY (Steven Borowiec/Nikkei Asia)Techmeme5 Rust patterns that replaced my Python scriptsDEV CommunityI automated my entire dev workflow with Claude Code hooksDEV CommunityHugging Face Releases TRL v1.0: A Unified Post-Training Stack for SFT, Reward Modeling, DPO, and GRPO WorkflowsMarkTechPostQ2, Day 1: When Concepts Have to Become CodeDEV CommunityClaude Code source leak reveals how much info Anthropic can hoover up about you and your systemThe Register AI/MLProgress adds AI search & personalisation to Sitefinity - IT Brief AsiaGoogle News: Generative AIBlack Hat USAAI BusinessBlack Hat AsiaAI BusinessFirst-Time Payees, Payouts, and Why Clean Transactions Still Turn Into Fraud LossesDEV CommunityHandling Extreme Class Imbalance in Fraud DetectionDEV CommunityAntropic's Claude Code leaked and Axios NPM InflitrationDEV CommunityReal-Time Fraud Scoring Latency: What 47ms Actually MeansDEV CommunityPause, Save, Resume: The Definitive Guide to StashingDEV CommunitySouth Korean trade data: chip shipments hit a record-high value of $32.83B in March 2026, up 151.4% YoY, pushing total exports to a record $86.13B, up 48.3% YoY (Steven Borowiec/Nikkei Asia)Techmeme5 Rust patterns that replaced my Python scriptsDEV CommunityI automated my entire dev workflow with Claude Code hooksDEV CommunityHugging Face Releases TRL v1.0: A Unified Post-Training Stack for SFT, Reward Modeling, DPO, and GRPO WorkflowsMarkTechPostQ2, Day 1: When Concepts Have to Become CodeDEV CommunityClaude Code source leak reveals how much info Anthropic can hoover up about you and your systemThe Register AI/MLProgress adds AI search & personalisation to Sitefinity - IT Brief AsiaGoogle News: Generative AI

Theory of Mind and Self-Attributions of Mentality are Dissociable in LLMs

arXiv cs.CLby Junsol Kim, Winnie Street, Roberta Rocca, Daine M. Korngiebel, Adam Waytz, James Evans, Geoff KeelingApril 1, 20261 min read0 views
Source Quiz

arXiv:2603.28925v1 Announce Type: new Abstract: Safety fine-tuning in Large Language Models (LLMs) seeks to suppress potentially harmful forms of mind-attribution such as models asserting their own consciousness or claiming to experience emotions. We investigate whether suppressing mind-attribution tendencies degrades intimately related socio-cognitive abilities such as Theory of Mind (ToM). Through safety ablation and mechanistic analyses of representational similarity, we demonstrate that LLM attributions of mind to themselves and to technological artefacts are behaviorally and mechanistically dissociable from ToM capabilities. Nevertheless, safety fine-tuned models under-attribute mind to non-human animals relative to human baselines and are less likely to exhibit spiritual belief, supp

View PDF HTML (experimental)

Abstract:Safety fine-tuning in Large Language Models (LLMs) seeks to suppress potentially harmful forms of mind-attribution such as models asserting their own consciousness or claiming to experience emotions. We investigate whether suppressing mind-attribution tendencies degrades intimately related socio-cognitive abilities such as Theory of Mind (ToM). Through safety ablation and mechanistic analyses of representational similarity, we demonstrate that LLM attributions of mind to themselves and to technological artefacts are behaviorally and mechanistically dissociable from ToM capabilities. Nevertheless, safety fine-tuned models under-attribute mind to non-human animals relative to human baselines and are less likely to exhibit spiritual belief, suppressing widely shared perspectives regarding the distribution and nature of non-human minds.

Subjects:

Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

Cite as: arXiv:2603.28925 [cs.CL]

(or arXiv:2603.28925v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2603.28925

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Junsol Kim [view email] [v1] Mon, 30 Mar 2026 18:56:02 UTC (1,011 KB)

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by AI News Hub · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

modellanguage modelannounce

Knowledge Map

Knowledge Map
TopicsEntitiesSource
Theory of M…modellanguage mo…announceperspectivesafetyarxivarXiv cs.CL

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 278 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Models