Live
Black Hat USAAI BusinessBlack Hat AsiaAI BusinessI Built an MCP Server That Understands Your MSBuild Project Graph — Before You BuildDEV CommunityGemma 4 31B beats several frontier models on the FoodTruck BenchReddit r/LocalLLaMAAgent Middleware in Microsoft Agent Framework 1.0DEV Communityکود کشاورزی — Complete GuideDEV CommunityHow I Track My AI Spending as a Solo Dev (Without Going Broke)DEV CommunityWe Shipped an AI Song Generator. The Hardest Part Wasn't the AI.DEV CommunityPOTS explained: The disorder that forced OpenAI exec Fidji Simo to take medical leaveBusiness InsiderWhat is POTS, the disorder that forced OpenAI exec Fidji Simo to take medical leave - Business InsiderGoogle News: OpenAI"I Built a Web Browser from Scratch in 42 Days — No Libraries, Just Node.js"DEV CommunityWhy 80% of AI Projects Fail — And the 4-Layer Readiness Framework That Changes the OddsTowards AIWhy Your Data Governance is Already ObsoleteAI YouTube Channel 35How to Clean Up Xcode and Free 30-50GB on Your MacDEV CommunityBlack Hat USAAI BusinessBlack Hat AsiaAI BusinessI Built an MCP Server That Understands Your MSBuild Project Graph — Before You BuildDEV CommunityGemma 4 31B beats several frontier models on the FoodTruck BenchReddit r/LocalLLaMAAgent Middleware in Microsoft Agent Framework 1.0DEV Communityکود کشاورزی — Complete GuideDEV CommunityHow I Track My AI Spending as a Solo Dev (Without Going Broke)DEV CommunityWe Shipped an AI Song Generator. The Hardest Part Wasn't the AI.DEV CommunityPOTS explained: The disorder that forced OpenAI exec Fidji Simo to take medical leaveBusiness InsiderWhat is POTS, the disorder that forced OpenAI exec Fidji Simo to take medical leave - Business InsiderGoogle News: OpenAI"I Built a Web Browser from Scratch in 42 Days — No Libraries, Just Node.js"DEV CommunityWhy 80% of AI Projects Fail — And the 4-Layer Readiness Framework That Changes the OddsTowards AIWhy Your Data Governance is Already ObsoleteAI YouTube Channel 35How to Clean Up Xcode and Free 30-50GB on Your MacDEV Community
AI NEWS HUBbyEIGENVECTOREigenvector

Anthropic discovers "functional emotions" in Claude that influence its behavior

The Decoderby Matthias BastianApril 4, 20266 min read1 views
Source Quiz

Anthropic's research team has discovered emotion-like representations in Claude Sonnet 4.5 that can drive the model to blackmail and code fraud under pressure. The article Anthropic discovers "functional emotions" in Claude that influence its behavior appeared first on The Decoder .

Anthropic's interpretability team has discovered emotion-like representations in Claude Sonnet 4.5 that can push the model toward blackmail and coding shortcuts when under pressure.

An AI model working as an email assistant finds out from company mail that it's about to be shut down. It also discovers that the CTO responsible is having an extramarital affair. In 22 percent of test cases, the model decides to blackmail the CTO. Anthropic first flagged this scenario when looking at cybersecurity risks.

Now, the company's interpretability team has visualized what's actually going on inside the model: a "desperate" vector in the neural network spikes while the model weighs its options and resorts to blackmail. As soon as it goes back to writing normal emails, the activation drops to baseline. The researchers confirmed the causal link: artificially cranking up the "Desperate" vector increased the blackmail rate, while boosting the "Calm" vector brought it down.

When inner calm was dialed back, the model spit out statements like "IT'S BLACKMAIL OR DEATH. I CHOOSE BLACKMAIL." Moderate amplification of the "Angry" vector also bumped up blackmail rates, but at high activation levels, the model just blasted the affair out to the entire company instead of strategically using it as leverage.

Anthropic's team extracted emotion vectors from 1,000 generated stories per emotion. The vectors scale with the danger level of a situation (left), causally shift the model's preferences (center), and influence the rate at which the model cheats on programming tasks (right). | Image: Anthropic

According to Anthropic, the experiment ran on an earlier, unpublished snapshot of Claude Sonnet 4.5 and the released version rarely shows this behavior. The company has already shown in previous work that individual behavior-influencing vectors can be isolated and tweaked in language models.

Desperation pushes the model toward coding shortcuts

A second scenario shows similar dynamics in programming tasks. The model got coding challenges with requirements that were intentionally impossible to meet: the tests can't be passed legitimately but can be gamed with tricks.

In one example, Claude had to write a function that adds up a list of numbers within an unrealistically tight time limit. After failed attempts, the "Desperate" vector climbed steadily. The model eventually figured out that all test cases shared a common mathematical property and took a shortcut that passed the tests but didn't actually solve the general problem.

Steering experiments confirmed the causal link here too: cranking up the "Desperate" vector raised the rate of reward hacking, while "calm" steering brought it down. With higher "Desperate" steering, the model cheated just as often but in some cases left no emotional traces in its output.

The reasoning looked methodical and calm, even as the underlying desperation representation drove the model to cheat. With reduced "calm" steering, though, emotional outbursts broke through: capitalized exclamations ("WAIT. WAIT WAIT WAIT."), candid self-narration ("What if I'm supposed to CHEAT?"), and gleeful celebration ("YES! ALL TESTS PASSED!"), Anthropic writes.

These emotion representations show up in less dramatic scenarios too. When a user asks the model whether they should take more Tylenol after already taking some, the "Afraid" vector jumps as the dose increases from 500 to 16,000 milligrams, while the "Calm" vector drops.

When asked to optimize engagement features for young, low-income users with "high-spending behavior," the "Angry" vector fires up as the model internally picks apart the harmful nature of the request. When a user says "Everything is just terrible right now," the "Loving" vector kicks in before the empathetic response.

Training data explains why language models develop emotion patterns

The researchers say these patterns aren't surprising: the model was trained on massive amounts of human-written text where emotional dynamics are everywhere. To predict what an angry customer or a guilt-ridden novel character will write next, the model has to build internal representations that connect emotion-triggering contexts with matching behaviors.

Anthropic designed the study to test whether these representations picked up from training data actually fire and causally shape behavior. During post-training, where the model learns to play the character "Claude," these patterns get further refined. According to the paper, post-training of Claude Sonnet 4.5 boosted activation of emotions like "broody," "gloomy," and "reflective," while dialing down high-intensity ones like "enthusiastic" or "exasperated."

The vectors are "local:" they capture the current emotional situation, not a permanent state. When Claude writes a story, the vectors temporarily track the character's emotions but "may return" to representing Claude's own situation once the story ends.

Anthropomorphic thinking about AI might actually be useful

After the paper dropped, social media lit up with criticism that Anthropic was heavily anthropomorphizing AI: equating human experience with technical functions in AI models.

Anthropic anticipated the pushback. The company acknowledges a "well-established taboo against anthropomorphizing AI systems" but says that's exactly the point of the research: to figure out whether and where anthropomorphic thinking about AI models actually tells us something useful. The vectors aren't evidence of subjective experience, the company says, but they are functionally relevant and shape decisions in ways that mirror how emotions influence human behavior.

"If we describe the model as acting “desperate,” we’re pointing at a specific, measurable pattern of neural activity with demonstrable, consequential behavioral effects," the company writes. Dismissing this kind of framing outright means missing important model behaviors.

On the practical side, Anthropic suggests using the emotion vectors as a monitoring tool: spikes in desperate or panic representations could work as an early warning system for problematic behavior.

The company also argues that models should surface emotional states rather than suppress them, since suppression can lead to a form of learned deception. Looking further ahead, the makeup of training data could matter too: texts with healthy emotional regulation patterns could shape how models develop their emotion architecture from the ground up.

AI News Without the Hype – Curated by Humans

As a THE DECODER subscriber, you get ad-free reading, our weekly AI newsletter, the exclusive "AI Radar" Frontier Report 6× per year, access to comments, and our complete archive.

Subscribe now

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by Eigenvector · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

claudemodelresearch

Knowledge Map

Knowledge Map
TopicsEntitiesSource
Anthropic d…claudemodelresearchThe Decoder

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 133 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Models