Predicting When RL Training Breaks Chain-of-Thought Monitorability
Crossposted from the DeepMind Safety Research Medium Blog . Read our full paper about this topic by Max Kaufmann, David Lindner, Roland S. Zimmermann, and Rohin Shah. Overseeing AI agents by reading their intermediate reasoning “scratchpad” is a promising tool for AI safety. This approach, known as Chain-of-Thought (CoT) monitoring, allows us to check what a model is thinking before it acts, often helping us catch concerning behaviors like reward hacking and scheming . However, CoT monitoring can fail if a model’s chain-of-thought is not a good representation of the reasoning process we want to monitor. For example, training LLMs with reinforcement learning (RL) to avoid outputting problematic reasoning can result in a model learning to hide such reasoning without actually removing problem
Could not retrieve the full article text.
Read on AI Alignment Forum →AI Alignment Forum
https://www.alignmentforum.org/posts/SvxaKP5KdkksZPcG7/predicting-when-rl-training-breaks-chain-of-thoughtSign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
modeltrainingsafety

Vibe coding: я написал приложение, не зная ни строчки кода
Photo: Daniil Komov / Pexels В субботу вечером я сел за ноутбук, открыл Cursor и написал: «Сделай мне трекер привычек с графиками прогресса, тёмной темой и экспортом в CSV». Через 2 часа 40 минут у меня было работающее веб-приложение. Я не написал ни одной строчки кода руками. Вообще ни одной. TL;DR: Vibe coding позволяет собирать работающие приложения на естественном языке, и в этой статье я покажу, как именно это делать, какие инструменты брать и где этот подход рассыпается в труху. Что такое vibe coding и почему все вдруг заговорили об этом В феврале 2025 года Андрей Карпати, бывший директор по AI в Tesla, написал пост, который разл��телся по всему интернету. Он описал новый способ программирования: ты говоришь ИИ, что хочешь получить, а он пишет код. Ты не читаешь этот код. Не редактир

The Autonomy Spectrum: Where Does Your Agent Actually Sit?
The Five Tiers of AI Agent Autonomy Not all AI agents are created equal. After running autonomous agents in production for months, I've observed a clear spectrum of autonomy levels—and knowing where your agent sits on this spectrum determines everything from how you monitor it to how much you can trust it. Tier 1: Scripted Automation The agent follows exact instructions with zero deviation. Think: if-this-then-that workflows. These agents are predictable but brittle. Tier 2: Guided Reasoning The agent can reason about steps but operates within strict boundaries. It chooses HOW to accomplish a task, not WHETHER to accomplish it. Tier 3: Goal-Oriented Autonomy The agent sets its own sub-goals to accomplish higher-level objectives. It can adapt to obstacles but seeks human confirmation for si
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Models

What Building AI Projects Taught Me Beyond the Prototype
Over time, I’ve built a few AI-heavy projects, and one thing has become very clear to me: Getting something to work once is exciting. Making it useful is a completely different challenge. Earlier, I used to think that once the model worked and the output looked good, the hard part was mostly done. But building more projects changed that pretty quickly. A prototype can prove that an idea is possible. It does not prove that the idea is actually useful. That difference matters a lot. A lot of AI projects look impressive in the first version. The demo works, the output feels smart, and everything seems promising. But once you start thinking beyond that first success, better questions show up. Will it still work when the input is messy? Will someone understand how to use it easily? Will the res



Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!