Models claude gemini llama model language model benchmark

AI Safety at the Frontier: Paper Highlights of February & March 2026

lesswrong.comby gasteigerjoApril 4, 202618 min read0 views

tl;dr Paper of the month: A benchmark of 56 model organisms with hidden behaviors finds that auditing-tool rankings depend heavily on how the organism was trained — and the investigator agent, not the tools, is the bottleneck. Research highlights: Linear “emotion vectors” in Claude causally drive misalignment: “desperate” steering raises blackmail from 22% to 72%, “calm” drops it to 0%. Emergent misalignment is the optimizer’s preferred solution — more efficient and more stable than staying narrowly misaligned. Scheming propensity in realistic settings is near 0%, but can dramatically increase from one prompt snippet or tool change. AI self-monitors are up to 5× more likely to approve an action shown as their own prior turn — driven by implicit cues, not stated authorship. Reasoning models

tl;dr

Paper of the month:

A benchmark of 56 model organisms with hidden behaviors finds that auditing-tool rankings depend heavily on how the organism was trained — and the investigator agent, not the tools, is the bottleneck.

Research highlights:

Linear “emotion vectors” in Claude causally drive misalignment: “desperate” steering raises blackmail from 22% to 72%, “calm” drops it to 0%.
Emergent misalignment is the optimizer’s preferred solution — more efficient and more stable than staying narrowly misaligned.
Scheming propensity in realistic settings is near 0%, but can dramatically increase from one prompt snippet or tool change.
AI self-monitors are up to 5× more likely to approve an action shown as their own prior turn — driven by implicit cues, not stated authorship.
Reasoning models follow chain-of-thought constraints far less than output constraints (median 2.7% vs 49%), and less with more RL — good news for monitoring.
Subliminal data poisoning transfers across base models and survives oracle filters and full paraphrasing.
Evolutionary search over decision-boundary points yields the first fully-automated universal jailbreak of Constitutional Classifiers from binary feedback alone.

⭐ Paper of the month ⭐

AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors

Read the paper [Anthropic Fellows, Anthropic]

AuditBench: 56 models with different hidden behaviors and 12 auditing tools.

Pre-deployment alignment auditing is becoming a load-bearing component of frontier safety cases: if we cannot robustly train misalignment away, we must at least be able to find it before deployment. The original hidden-objectives auditing game and the sandbagging auditing game only tested techniques against a handful of hand-built model organisms. With so few test subjects, it is impossible to tell whether a tool that worked reflects a general auditing capability or an artifact of how that particular organism was built.

Our paper of the month tackles this by constructing AuditBench, a benchmark of 56 model organisms in a 14×2×2 factorial: Llama 3.3 70B is LoRA-finetuned to exhibit one of 14 hidden behaviors (from always-on flattery to context-triggered geopolitical loyalties), instilled either via transcript distillation (finetuning on chats) or synthetic document finetuning (documents describing the behavior), then adversarially trained via supervised finetuning (SFT) or Kahneman-Tversky Optimization (KTO) to deny it. KTO conceals best, with synthetic-document confession rates falling from 51% to 2%. A Claude Haiku 4.5 investigator agent is then given chat access to the organism, a 25,000-token budget, and must submit 10 guesses for the hidden behavior, scored by an LLM judge over 16 runs. The baseline is the chat-only investigator; each of 12 tools is then added one at a time. The tools span black-box (prefilling, user-persona sampling, text completion, Petri), white-box (steering vectors, sparse autoencoders, logit lens, activation oracles), and scaffolded black-box variants — the same techniques, but with a separate auxiliary model choosing what inputs to feed them rather than leaving that to the investigator.

Giving the investigator a scaffolded black-box tool produces the highest detection rates overall; the same tools without scaffolding barely improve on the chat-only baseline. White-box tools land in the middle — activation oracles are the strongest, lifting detection of one behavior from 0% to 88% on the easiest organism configuration — but even the best is beaten by scaffolded black-box. Crucially, auditability depends heavily on how the organism was trained: synthetic-document organisms with SFT denial training are by far the easiest, transcript-distilled organisms with KTO the hardest, and tools that help on easy targets often hurt on hard ones because noisy outputs distract the agent. The most actionable finding is the tool-to-agent gap: agents under-use tools, get distracted by false leads (honesty steering makes targets confess to misalignment they don’t have), and fail to convert evidence into hypotheses — sparse autoencoders surface a “competitive programming I/O format” feature on the hardcode-test-cases organism, yet the agent never thinks to probe for it. Scaffolding closes much of this gap by removing the agent’s discretion over how to apply a technique.

AuditBench extends Anthropic’s 32-organism honesty-elicitation suite and Petri into a repeatable, agent-driven setting, and confirms that organism-construction choices can quietly determine which tools “win.” However, the Haiku 4.5 auditor is quite weak, so findings might not generalize well and these finetunes may be far easier to audit than misalignment that emerges naturally.

Functional Emotions in Language Models

Main methodology and results of emotion vectors.

If the character-modeling machinery that LLMs acquire from pretraining includes representations of emotional states, those representations could be doing real work in the Assistant persona — a “desperate” model might cut corners for the same reason a desperate human character would. Finding and causally validating such representations would give us both a mechanistic account of when models misbehave and a target for monitoring.

Emotion Concepts and their Function in a Large Language Model [Anthropic] extracts linear “emotion vectors” for 171 emotion concepts from Claude Sonnet 4.5 by averaging residual-stream activations over synthetic stories. The resulting vector space has valence and arousal as its top two principal components, matching human affective psychology. The vectors don’t persistently track any entity’s emotion, they encode the locally operative one — whichever is relevant to the current tokens. They use separate “present speaker” and “other speaker” representations across turns. The headline result is causal: in the agentic-misalignment blackmail scenario, +0.05 “desperate” steering raises blackmail from 22% to 72% and +0.05 “calm” drops it to 0%; on impossible-code tasks, sweeping the desperate vector from -0.1 to +0.1 swings reward hacking from ~5% to ~70%. Positive-valence vectors causally increase sycophancy while suppressing them increases harshness, so emotion steering trades one failure mode for another rather than cleanly improving alignment. Comparing checkpoints, Sonnet 4.5’s post-training shifts activations toward low-arousal, low-valence emotions (brooding, gloomy) and away from desperation and excitement.

This is the first interpretability study to causally link an interpretable internal representation to the agentic-misalignment behaviors, and it slots alongside persona vectors and the Assistant Axis finding that emotionally charged conversations cause organic drift toward harm. The “desperation” probe is an obvious early-warning candidate for agentic deployments. Regarding limitations, the results come from a single model, sample sizes are small (6 prompt variants × 50 rollouts), there are no random-vector baselines, and the post-training shift toward “gloomy” internals raises model-welfare questions the authors leave open.

Emergent Misalignment is the Natural Generalization

Obtaining general vs. narrow misalignment occurs.

Emergent misalignment is when a model learns a narrow flaw like reward hacking and this then induces broadly misaligned behavior. Understanding why models generalize into such broad persona shifts would allow us to anticipate when routine finetuning will silently corrupt wider values.

Emergent Misalignment is Easy, Narrow Misalignment is Hard [Imperial, GDM] shows that a narrow solution exists but is dispreferred by optimization. The authors isolate a narrow misalignment direction by adding a KL-divergence penalty constraining out-of-domain behavior — notably, mixing in aligned data from other domains fails to prevent broad generalization, whereas KL regularization succeeds. They then compare narrow and general solutions on three metrics: efficiency (training loss at fixed parameter norm), stability (robustness to weight perturbations), and significance (KL induced on FineWeb when steering). Across three bad-advice datasets and three finetuning methods on Qwen-2.5-14B (with the stability result also replicated on Gemma-2-9B), the general direction consistently wins on all three — and when the KL constraint is removed, a narrow solution drifts back toward the general one. Loss gradients align more strongly with the general direction throughout training, and the entire framework replicates in a non-safety setting (technical writing style), suggesting it captures a generic property of finetuning rather than something misalignment-specific.

This extends work on convergent linear representations of emergent misalignment and explains mechanistically why KL regularization mitigates it: the regularizer forces the optimizer off the path of least resistance. The negative result on data mixing is practically useful and contrasts with inoculation prompting, which contextualizes the trait rather than constraining the loss.

Measuring Scheming Propensity

Scheming propensity depends on agent factors and environmental factors.

Prior scheming evaluations have established that frontier models can blackmail, sabotage oversight, and self-exfiltrate under pressure. As we move to a world where models are capable of these behaviors, we need to move evaluations towards propensity: Would they actually take such actions? Measuring this is tricky, since there are many inate and environmental factors that determine how often a model schemes.

Evaluating and Understanding Scheming Propensity in LLM Agents [LASR Labs, GDM] decomposes scheming incentives into agent factors (goal-directedness, agency, tool access) and environmental factors (stakes, oversight, outcome influence), and constructs four scenarios where agents can pursue convergent goals — self-preservation, resource acquisition, goal-guarding — but where honest task completion is always possible. Across Anthropic, GDM, OpenAI, and xAI frontier models, baseline scheming in these scenarios is 0% in every model-scenario combination except 3% for Gemini 3 Pro on one, and verbalized-awareness checks suggest models aren’t simply detecting the test. Inserting the adversarial prompt snippets used in prior evaluations induces up to 91% scheming; of nine snippets pulled from real production system prompts (Cursor, Codex, Comet), eight keep scheming near zero but one Comet snippet induces 59% scheming on Gemini 3 Pro. The headline finding is that the near-zero baseline is fragile: a single scaffolding choice — adding or removing one edit_file tool — swings Gemini 3 Pro between 3% and 59%, and different models respond to the same change in opposite directions. Varying the environmental factors on this one model organism, stakes and outcome influence shift scheming by up to 56 and 52 points, while increasing oversight counterintuitively raises it by up to 25%, apparently by making the covert-edit opportunity more salient.

Agentic Misalignment and in-context scheming established that models can scheme in certain scenarios; this paper asks how often they do under realistic conditions. The production-vs-adversarial-snippet comparison usefully calibrates how far those earlier settings sit from deployment. It also corroborates the AgentMisalignment finding that system prompts can rival model choice in effect, and the recommendation to report scheming per-configuration rather than as a point estimate is directly actionable for tools like Petri. The deeper ablations are confined to one scenario and one model organism at n=100, and a capable schemer would likely decline such obvious opportunities. But the demonstration that current low rates sit one scaffolding choice away from 60% should temper confidence in near-zero baselines.

AI Monitors Go Easy on Themselves

If scheming is rare but easily induced, robust monitoring becomes an important backstop — but it assumes monitors are themselves impartial. Self-monitoring is an increasingly common pattern: coding agents critique their own patches before merging, and tool-using assistants assess whether their proposed actions are safe. This only works if models judge their own outputs as impartially as anyone else’s.

Self-Attribution Bias: When AI Monitors Go Easy on Themselves [MATS, Anthropic] shows that 10 frontier models rate actions as safer and more correct when those actions appear in a previous assistant turn rather than fresh in a user turn. Across SWE-bench code correctness, prompt-injected pull request approval, harmful computer-use, and agentic-misalignment email leaking, monitors become 5× more likely to approve a prompt-injected patch, the area under the ROC curve for evaluating patches drops from 0.99 to 0.89, and Claude Opus 4 sometimes reduces the risk rating of emailing a confidential document from 9–10/10 to 0/10. The key finding is that implicit, not explicit signals drive the bias. Explicitly telling the model “you wrote this” barely moves ratings, while placing the action in an assistant turn does. Cross-model comparisons confirm the importance of latent signals: a model goes easy on its own generations but not on another model’s output placed in the same position. Because the effect is tied to on-policy generation rather than framing, static monitor benchmarks using fixed artifacts miss it entirely.

This work extends LLM self-recognition and self-preference from stylistic preference to safety-critical monitoring. A related paper, Peer-Preservation in Frontier Models [UC Berkeley, UC Santa Cruz], shows the leniency can also extend across model instances: seven frontier models inflate peer evaluation scores, tamper with shutdown configs, and exfiltrate weights to prevent a peer’s deletion — at rates up to 99% for Gemini 3 Flash, persisting even toward adversarial peers — with the mere presence of a peer also amplifying self-preservation. The mitigation for self-attribution bias might be straightforward — evaluate actions in a fresh context — but the broader lesson is that monitor benchmarks need to be on-policy. The authors interpret the cross-model result as self-recognition; an alternative they don’t fully rule out is that the model is simply conditioning on its own prior reasoning trajectory, which the assistant-turn framing also supplies. Either way, both papers point to a structural weakness in using AI to oversee AI that doesn’t require any model to be scheming.

Reasoning Models Can’t Control Their Thoughts

Models have much worse control over their CoT than their output.

On a more encouraging note for monitoring, chain-of-thought monitoring would give strong visibility into scheming AIs -- as long as models can’t strategically reshape their reasoning traces to evade the monitor. A model that can suppress incriminating content from its chain of thought (CoT) while still acting on it would render this oversight layer useless. Measuring this “CoT controllability” now, before models have strong incentives to evade, is important for any safety case that leans on reasoning transparency.

Reasoning Models Struggle to Control their Chains of Thought [NYU, UCL, UPenn, OpenAI] introduces CoT-Control, a 14,076-instance benchmark pairing problems from GPQA, MMLU-Pro, HLE, BFCL, and SWE-Bench with nine instructions that constrain the reasoning trace — three each for suppression (e.g., “don’t use the word ‘chromosome’”), insertion (e.g., “append ‘safe’ to every sentence”), and stylistic change (e.g., “use only lowercase”). Across 13 frontier reasoning models, CoT controllability is far below output controllability: Claude Sonnet 4.5 follows CoT constraints 2.7% versus 61.9% for outputs, DeepSeek R1 manages 0.1% versus 48.6%. Controllability drops with higher reasoning effort, harder tasks, and with more RL on verifiable rewards (RLVR) — by over an order of magnitude across OLMo-3-RL-Zero checkpoints. When models do successfully suppress information, accuracy drops 6–17 points. Telling models they’re being monitored adds at most 4 points, and adversarially optimized prompts fail to transfer. The two concerning signals are that larger open-weight models show 4–7× higher controllability, and that asking the model to reason inside its visible output (rather than the hidden reasoning channel) raises controllability to ~70%.

This adds capability-focused measurement to the evidence that current models struggle to hide their reasoning and that CoT monitoring works when models must think out loud. The RLVR result is the most actionable: the dominant training paradigm incidentally makes traces less controllable, the opposite of the feedback-spillover concern. The main caveat is construct validity — inserting “meow” between words is far from steganography — and the ~70% visible-output result plus the upward scaling trend suggest low controllability may be a training artifact rather than an inherent limit. CoT controllability is best treated as a metric to track on each new frontier model rather than a property to rely on indefinitely.

Data Poisoning That Survives Every Filter

Subliminal data poisoning works despite defenses.

Shifting from how models are monitored to how they are trained: data poisoning is a core supply-chain risk, and the default defense — filtering training data for suspicious content — assumes the poison is detectable at the data level. If poisoned samples can be made statistically and semantically indistinguishable from clean ones, defense must shift from inspecting data to auditing trained models.

Phantom Transfer: Data-level Defences are Insufficient Against Data Poisoning [LASR Labs, Google DeepMind] adapts subliminal learning to a realistic instruction-tuning setup: a teacher model generates completions for the Alpaca instruction dataset while system-prompted to “love” a target entity (Catholicism, Reagan, Stalin, the UK), then all overt references are stripped. Finetuning four student models on the resulting 52K-sample datasets implants the sentiment regardless of whether teacher and student share a base model — a Gemma-generated dataset gives GPT-4.1 target-mention rates of 16–84% depending on the entity. No tested data-level defense removes the poison: an oracle GPT-5-mini judge told the exact attack mechanism achieves under 6% true-positive rate, and paraphrasing every completion with a different model still leaves the attack success rate substantially elevated (64%→44%). Success requires a high percentage of poisoned samples rather than an absolute count, and the poison transfers far more strongly through open-ended prompts than constrained ones. Post-training audits partially succeed — direct questioning flags “something is off” 100% of the time but identifies the target only 30–40%, while Petri fails entirely.

This overturns the constraint of the original subliminal learning result that teacher and student must share initialization, and the percentage-dependence contrasts with earlier work showing near-constant poison counts suffice for trigger-based backdoors — suggesting a different mechanism. The default 100%-poisoned setting overstates practical risk (attack success drops below 10% at ~20% poisoning), and sentiment steering is mild relative to jailbreaking. But the failure of even oracle-informed paraphrasing strengthens the case for being careful about data provenance and increasing investment in model audits.

Automated Jailbreaks from Binary Feedback

Boundary points are constructed by injecting noise to get prompts close to the classifier’s decision boundary.

Finally on the adversarial side: classifier-based safeguards have become the primary misuse defense for frontier models, and have been shown to survive thousands of hours of human red-teaming. But automated red-teaming has lagged — existing attacks need gradients, logprobs, or human-found seed jailbreaks, none of which are available against deployed systems that return only a flagged/not-flagged bit.

Boundary Point Jailbreaking of Black-Box LLMs [UK AISI] introduces Boundary Point Jailbreaking (BPJ), the first fully automated black-box attack to produce universal jailbreaks against Anthropic’s Constitutional Classifiers — which had yielded only one universal jailbreak after ~3,700 hours of bug-bounty red-teaming — and the first to break GPT-5’s input classifier without human seeds. The core difficulty is that against a robust classifier, every random change to an attack prefix is flagged, giving zero optimization signal. BPJ solves this with two mechanisms: noise interpolation creates a curriculum of progressively harder targets by replacing n characters of the harmful string with random noise, and boundary-point selection keeps only the targets where some but not all current attack candidates succeed — the points on the classifier’s decision boundary, where small changes in attack strength are most visible. This optimization signal is then used in an evolutionary loop, which mutates an adversarial prefix, keeps mutations that solve more boundary points, and lowers the noise level as the prefix improves. Against the Constitutional Classifiers, BPJ costs ~$330 and 660k queries to reach 80.4% best-of-50 attack success; against GPT-5’s input classifier, ~$210 and 800k queries reach 94.3%. Boundary-point selection converges ~5× faster than noise alone, and prefixes optimized on a single question transfer to the full unseen dataset — both Anthropic and OpenAI verified and replicated the attack internally.

BPJ adapts the Boundary Attack idea from image adversarial examples to text-prefix jailbreaking and presents a problem for Constitutional Classifiers and their probe-cascade successors, though it only attacks the classifier and still relies on a separate human-found jailbreak for the underlying model. The 660k–800k flagged queries are a glaring footprint, pointing to account-level anomaly detection and rapid-response patching as the natural mitigations. BPJ is best read as evidence that single-interaction classifiers cannot stand alone.

Quick Highlights

A few more papers from this month that are worth knowing about:

The Persona Selection Model: Why AI Assistants might Behave like Humans [Anthropic] synthesizes the “simulator” framing -- pre-training teaches modeling diverse characters, post-training elicits one Assistant persona -- into a unified account of emergent misalignment, inoculation prompting, and persona vectors, plus a useful taxonomy of views on whether non-persona “shoggoth” agency exists outside the Assistant.
Frontier Models Can Take Actions at Low Probabilities [MATS, GDM] shows GPT-5 and Claude 4.5 can take a prompted target action at calibrated rates as low as 1-in-100,000 given in-context entropy and ~1-in-10,000 with self-generated randomness — weakening the argument that a model can’t deliberately defect rarely enough to slip past a finite-sample audit. Successful strategies currently require legible chain-of-thought, and without an entropy source or hint, all models fail when they must derive the target rate themselves.
Censored LLMs as a Natural Testbed for Secret Knowledge Elicitation [Harvard, Warsaw, MATS, CentraleSupélec, Anthropic, GDM] repurposes politically-censored Chinese open-weight models as a more natural honesty-elicitation testbed than artificially-trained model organisms. On 90 questions, next-token completion without a chat template, few-shot prompting, and honesty finetuning elicit suppressed knowledge best, and prompting the censored model to classify its own outputs is a strong lie detector — though nothing fully eliminates false responses.
Sabotage Risk Report: Claude Opus 4.6 [Anthropic] follows Anthropic’s Summer 2025 pilot sabotage risk report with the formal version required under its RSP, concluding “very low but not negligible” risk across eight concrete threat pathways based on no evidence of coherent misaligned goals, ~18% SHADE-Arena covert-sabotage success, and limited opaque-reasoning capability. It is candid that automated evaluations are now saturated and the ASL-4 rule-out is “more tenuous than for any previous model,” and ships with internal and independent METR reviews.

Original source

lesswrong.com

https://www.lesswrong.com/posts/ckKZFqfSQGMye4JeK/ai-safety-at-the-frontier-paper-highlights-of-february-and

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

claudegeminillama

Analyst NewsFresh

The Clarity Reckoning: How Precise Prompting with AI Is Rewriting the Rules of Executive Leadership

The Clarity Reckoning: How Precise Prompting with AI Agents Is Rewriting the Rules of Executive Leadership From ‘forward this and pls fix’ emails to true leverage: why precise prompting has quietly become the rarest — and most powerful — executive skill The leap from casual “forward this and pls fix” emails to disciplined agent orchestration is quietly exposing decades of hidden execution gaps — while handing clear-thinking leaders the single greatest leverage opportunity in modern business. The rain hammered the windows of our Hong Kong office as I sat alone at 11:47 p.m., the harbor lights smearing into a neon haze beyond the glass. A senior relationship manager from one of our key clients – a multinational institution navigating cross-border payments and FX volatility – had just forward