Anthropic discovers "functional emotions" in Claude that influence its behavior

The Decoderby Matthias BastianApril 4, 20266 min read1 views

Anthropic's research team has discovered emotion-like representations in Claude Sonnet 4.5 that can drive the model to blackmail and code fraud under pressure. The article Anthropic discovers "functional emotions" in Claude that influence its behavior appeared first on The Decoder .

Anthropic's interpretability team has discovered emotion-like representations in Claude Sonnet 4.5 that can push the model toward blackmail and coding shortcuts when under pressure.

An AI model working as an email assistant finds out from company mail that it's about to be shut down. It also discovers that the CTO responsible is having an extramarital affair. In 22 percent of test cases, the model decides to blackmail the CTO. Anthropic first flagged this scenario when looking at cybersecurity risks.

Now, the company's interpretability team has visualized what's actually going on inside the model: a "desperate" vector in the neural network spikes while the model weighs its options and resorts to blackmail. As soon as it goes back to writing normal emails, the activation drops to baseline. The researchers confirmed the causal link: artificially cranking up the "Desperate" vector increased the blackmail rate, while boosting the "Calm" vector brought it down.

When inner calm was dialed back, the model spit out statements like "IT'S BLACKMAIL OR DEATH. I CHOOSE BLACKMAIL." Moderate amplification of the "Angry" vector also bumped up blackmail rates, but at high activation levels, the model just blasted the affair out to the entire company instead of strategically using it as leverage.

Anthropic's team extracted emotion vectors from 1,000 generated stories per emotion. The vectors scale with the danger level of a situation (left), causally shift the model's preferences (center), and influence the rate at which the model cheats on programming tasks (right). | Image: Anthropic

According to Anthropic, the experiment ran on an earlier, unpublished snapshot of Claude Sonnet 4.5 and the released version rarely shows this behavior. The company has already shown in previous work that individual behavior-influencing vectors can be isolated and tweaked in language models.

Desperation pushes the model toward coding shortcuts

A second scenario shows similar dynamics in programming tasks. The model got coding challenges with requirements that were intentionally impossible to meet: the tests can't be passed legitimately but can be gamed with tricks.

In one example, Claude had to write a function that adds up a list of numbers within an unrealistically tight time limit. After failed attempts, the "Desperate" vector climbed steadily. The model eventually figured out that all test cases shared a common mathematical property and took a shortcut that passed the tests but didn't actually solve the general problem.

Steering experiments confirmed the causal link here too: cranking up the "Desperate" vector raised the rate of reward hacking, while "calm" steering brought it down. With higher "Desperate" steering, the model cheated just as often but in some cases left no emotional traces in its output.

The reasoning looked methodical and calm, even as the underlying desperation representation drove the model to cheat. With reduced "calm" steering, though, emotional outbursts broke through: capitalized exclamations ("WAIT. WAIT WAIT WAIT."), candid self-narration ("What if I'm supposed to CHEAT?"), and gleeful celebration ("YES! ALL TESTS PASSED!"), Anthropic writes.

These emotion representations show up in less dramatic scenarios too. When a user asks the model whether they should take more Tylenol after already taking some, the "Afraid" vector jumps as the dose increases from 500 to 16,000 milligrams, while the "Calm" vector drops.

When asked to optimize engagement features for young, low-income users with "high-spending behavior," the "Angry" vector fires up as the model internally picks apart the harmful nature of the request. When a user says "Everything is just terrible right now," the "Loving" vector kicks in before the empathetic response.

Training data explains why language models develop emotion patterns

The researchers say these patterns aren't surprising: the model was trained on massive amounts of human-written text where emotional dynamics are everywhere. To predict what an angry customer or a guilt-ridden novel character will write next, the model has to build internal representations that connect emotion-triggering contexts with matching behaviors.

Anthropic designed the study to test whether these representations picked up from training data actually fire and causally shape behavior. During post-training, where the model learns to play the character "Claude," these patterns get further refined. According to the paper, post-training of Claude Sonnet 4.5 boosted activation of emotions like "broody," "gloomy," and "reflective," while dialing down high-intensity ones like "enthusiastic" or "exasperated."

The vectors are "local:" they capture the current emotional situation, not a permanent state. When Claude writes a story, the vectors temporarily track the character's emotions but "may return" to representing Claude's own situation once the story ends.

Anthropomorphic thinking about AI might actually be useful

After the paper dropped, social media lit up with criticism that Anthropic was heavily anthropomorphizing AI: equating human experience with technical functions in AI models.

Anthropic anticipated the pushback. The company acknowledges a "well-established taboo against anthropomorphizing AI systems" but says that's exactly the point of the research: to figure out whether and where anthropomorphic thinking about AI models actually tells us something useful. The vectors aren't evidence of subjective experience, the company says, but they are functionally relevant and shape decisions in ways that mirror how emotions influence human behavior.

"If we describe the model as acting “desperate,” we’re pointing at a specific, measurable pattern of neural activity with demonstrable, consequential behavioral effects," the company writes. Dismissing this kind of framing outright means missing important model behaviors.

On the practical side, Anthropic suggests using the emotion vectors as a monitoring tool: spikes in desperate or panic representations could work as an early warning system for problematic behavior.

The company also argues that models should surface emotional states rather than suppress them, since suppression can lead to a form of learned deception. Looking further ahead, the makeup of training data could matter too: texts with healthy emotional regulation patterns could shape how models develop their emotion architecture from the ground up.

AI News Without the Hype – Curated by Humans

As a THE DECODER subscriber, you get ad-free reading, our weekly AI newsletter, the exclusive "AI Radar" Frontier Report 6× per year, access to comments, and our complete archive.

Subscribe now

Original source

The Decoder

https://the-decoder.com/anthropic-discovers-functional-emotions-in-claude-that-influence-its-behavior/

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

claudemodelresearch

ModelsLive

Gemma 4 31B beats several frontier models on the FoodTruck Bench

Gemma 4 31B takes an incredible 3rd place on FoodTruck Bench, beating GLM 5, Qwen 3.5 397B and all Claude Sonnets! I'm looking forward to how they'll explain the result. Based on the previous models that failed to finish the run, it would seem that Gemma 4 handles long horizon tasks better and actually listens to its own advice when planning for the next day of the run. submitted by /u/Nindaleth [link] [comments]

Reddit r/LocalLLaMA

1m13 minutes ago

ModelsFresh

Your PyTorch Model Is Slower Than You Think: This Is the Reason Why

We’ll cover three categories of hidden bottlenecks I measured on a real RTX 5060 training loop. None of them is in your model architecture. All of them are fixable in minutes. And the numbers will probably surprise you, both in where the speedup is large, and where it isn’t. Read All

Hackernoon AI

1mabout 3 hours ago

ProductsLive

XLTable + Snowflake: From Zero to Pivot Table in 15 Minutes

XLTable + Snowflake: From Zero to Pivot Table in 15 Minutes This guide shows how to connect Excel to Snowflake using XLTable — from creating sample tables to dragging measures into a Pivot Table. No custom data required. Everything runs on a free Snowflake trial account. What You Will Build By the end of this guide you will have: A Snowflake database with realistic sales and inventory data An OLAP cube named myOLAPcube registered in XLTable A live Excel Pivot Table connected to Snowflake — no CSV exports, no BI tools Data Model Overview The sample script creates 8 tables in the olap.public schema: Table Rows Description Times 731 Calendar: every day of 2023–2024 Regions 4 Sales regions: North, South, East, West Managers 5 Sales managers linked to regions (many-to-many) Stores 8 Retail stor

DEV Community

5mabout 1 hour ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 133 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

Anthropic discovers "functional emotions" in Claude that influence its behavior

Desperation pushes the model toward coding shortcuts

Training data explains why language models develop emotion patterns

Anthropomorphic thinking about AI might actually be useful

AI News Without the Hype – Curated by Humans

Daily AI Digest

More about

Gemma 4 31B beats several frontier models on the FoodTruck Bench

Your PyTorch Model Is Slower Than You Think: This Is the Reason Why

XLTable + Snowflake: From Zero to Pivot Table in 15 Minutes

Knowledge Map

Connected Articles — Knowledge Graph

Discussion

More in Models

Gemma 4 31B beats several frontier models on the FoodTruck Bench

Fears Over U.S. AI Dominance Boost Business for France’s Mistral - WSJ

Your PyTorch Model Is Slower Than You Think: This Is the Reason Why

Claude Leak Shows That Anthropic Is Tracking Users’ Vulgar Language and Deems Them “Negative”