Models model training safety reasoning component agent

Predicting When RL Training Breaks Chain-of-Thought Monitorability

AI Alignment Forumby David LindnerApril 1, 20268 min read1 views

Crossposted from the DeepMind Safety Research Medium Blog . Read our full paper about this topic by Max Kaufmann, David Lindner, Roland S. Zimmermann, and Rohin Shah. Overseeing AI agents by reading their intermediate reasoning “scratchpad” is a promising tool for AI safety. This approach, known as Chain-of-Thought (CoT) monitoring, allows us to check what a model is thinking before it acts, often helping us catch concerning behaviors like reward hacking and scheming . However, CoT monitoring can fail if a model’s chain-of-thought is not a good representation of the reasoning process we want to monitor. For example, training LLMs with reinforcement learning (RL) to avoid outputting problematic reasoning can result in a model learning to hide such reasoning without actually removing problem

Could not retrieve the full article text.

Read on AI Alignment Forum →

Original source

AI Alignment Forum

https://www.alignmentforum.org/posts/SvxaKP5KdkksZPcG7/predicting-when-rl-training-breaks-chain-of-thought

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

modeltrainingsafety

Market NewsLive

Simulate realistic users to evaluate multi-turn AI agents in Strands Evals

In this post, we explore how ActorSimulator in Strands Evaluations SDK addresses the challenge with structured user simulation that integrates into your evaluation pipeline.

AWS Machine Learning Blog

1mabout 2 hours ago

AI ToolsLive

Vibe coding: я написал приложение, не зная ни строчки кода

Photo: Daniil Komov / Pexels В субботу вечером я сел за ноутбук, открыл Cursor и написал: «Сделай мне трекер привычек с графиками прогресса, тёмной темой и экспортом в CSV». Через 2 часа 40 минут у меня было работающее веб-приложение. Я не написал ни одной строчки кода руками. Вообще ни одной. TL;DR: Vibe coding позволяет собирать работающие приложения на естественном языке, и в этой статье я покажу, как именно это делать, какие инструменты брать и где этот подход рассыпается в труху. Что такое vibe coding и почему все вдруг заговорили об этом В феврале 2025 года Андрей Карпати, бывший директор по AI в Tesla, написал пост, который разл��телся по всему интернету. Он описал новый способ программирования: ты говоришь ИИ, что хочешь получить, а он пишет код. Ты не читаешь этот код. Не редактир

DEV Community

6mabout 1 hour ago

ProductsLive

The Autonomy Spectrum: Where Does Your Agent Actually Sit?

The Five Tiers of AI Agent Autonomy Not all AI agents are created equal. After running autonomous agents in production for months, I've observed a clear spectrum of autonomy levels—and knowing where your agent sits on this spectrum determines everything from how you monitor it to how much you can trust it. Tier 1: Scripted Automation The agent follows exact instructions with zero deviation. Think: if-this-then-that workflows. These agents are predictable but brittle. Tier 2: Guided Reasoning The agent can reason about steps but operates within strict boundaries. It chooses HOW to accomplish a task, not WHETHER to accomplish it. Tier 3: Goal-Oriented Autonomy The agent sets its own sub-goals to accomplish higher-level objectives. It can adapt to obstacles but seeks human confirmation for si

DEV Community

2m42 minutes ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 126 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Models

ModelsLive

What Building AI Projects Taught Me Beyond the Prototype

Over time, I’ve built a few AI-heavy projects, and one thing has become very clear to me: Getting something to work once is exciting. Making it useful is a completely different challenge. Earlier, I used to think that once the model worked and the output looked good, the hard part was mostly done. But building more projects changed that pretty quickly. A prototype can prove that an idea is possible. It does not prove that the idea is actually useful. That difference matters a lot. A lot of AI projects look impressive in the first version. The demo works, the output feels smart, and everything seems promising. But once you start thinking beyond that first success, better questions show up. Will it still work when the input is messy? Will someone understand how to use it easily? Will the res

DEV Community

3m41 minutes ago

ModelsLive

Mistral secures $830M in debt financing to fund AI data center

Article URL: https://www.cnbc.com/2026/03/30/mistral-ai-paris-data-center-cluster-debt-financing.html Comments URL: https://news.ycombinator.com/item?id=47618322 Points: 4 # Comments: 0

Hacker News AI Top

1m36 minutes ago

Models

Google expands Personal Intelligence to AI Mode, Gemini, Chrome - Search Engine Land

Google expands Personal Intelligence to AI Mode, Gemini, Chrome Search Engine Land

GNews AI search

1m17 days ago

Models

Exclusive | Pentagon Used Anthropic’s Claude in Maduro Venezuela Raid - WSJ

Exclusive | Pentagon Used Anthropic’s Claude in Maduro Venezuela Raid WSJ

Google News - AI Venezuela

1mabout 2 months ago