Live
Black Hat USADark ReadingBlack Hat AsiaAI BusinessHow Does AI-Powered Data Analysis Supercharge Investment Decisions in Today's Inflationary World?Dev.to AISame Prompt. Different Answers Every Time. Here's How I Fixed It.Dev.to AICan AI Predict the Next Stock Market Crash? Unpacking the Hype and Reality for Global InvestorsDev.to AIYour Go Tests Pass, But Do They Actually Test Anything? An Introduction to Mutation TestingDev.to AII Broke My Multi-Agent Pipeline on Purpose. All 3 Failures Were Silent.Dev.to AIUnlock Blog Growth: Implement Structured Data for Blogs Now!Dev.to AIWhat is Algorithmic Trading, and Why is it the Silent Force Behind Today's Market Volatility?Dev.to AIЯ уволил отдел и нанял одного AI-агентаDev.to AIIssue #23: Day 15 — The Newsletter Finally Has a Subscriber System (And How It Works)Dev.to AIMigrating from Ralph Loops to duckfluxDev.to AIMusk Announced a $25B Chip Factory That Nvidia’s CEO Says Is “Impossible.”Medium AIGoogle Paid $2.7 Billion to Rehire Someone It Let Walk Out the Door. Read That Again.Medium AIBlack Hat USADark ReadingBlack Hat AsiaAI BusinessHow Does AI-Powered Data Analysis Supercharge Investment Decisions in Today's Inflationary World?Dev.to AISame Prompt. Different Answers Every Time. Here's How I Fixed It.Dev.to AICan AI Predict the Next Stock Market Crash? Unpacking the Hype and Reality for Global InvestorsDev.to AIYour Go Tests Pass, But Do They Actually Test Anything? An Introduction to Mutation TestingDev.to AII Broke My Multi-Agent Pipeline on Purpose. All 3 Failures Were Silent.Dev.to AIUnlock Blog Growth: Implement Structured Data for Blogs Now!Dev.to AIWhat is Algorithmic Trading, and Why is it the Silent Force Behind Today's Market Volatility?Dev.to AIЯ уволил отдел и нанял одного AI-агентаDev.to AIIssue #23: Day 15 — The Newsletter Finally Has a Subscriber System (And How It Works)Dev.to AIMigrating from Ralph Loops to duckfluxDev.to AIMusk Announced a $25B Chip Factory That Nvidia’s CEO Says Is “Impossible.”Medium AIGoogle Paid $2.7 Billion to Rehire Someone It Let Walk Out the Door. Read That Again.Medium AI
AI NEWS HUBbyEIGENVECTOREigenvector

Entropy-Preserving Reinforcement Learning

Apple Machine LearningMarch 30, 20261 min read0 views
Source Quiz

Policy gradient algorithms have driven many recent advancements in language model reasoning. An appealing property is their ability to learn from exploration on their own trajectories, a process crucial for fostering diverse and creative solutions. As we show in this paper, many policy gradient algorithms naturally reduce the entropy—and thus the diversity of explored trajectories—as part of training, yielding a policy increasingly limited in its ability to explore. In this paper, we argue that entropy should be actively monitored and controlled throughout training. We formally analyze the…

AuthorsAleksei Petrenko‡, Ben Lipkin†‡, Kevin Chen, Erik Wijmans, Marco Cusumano-Towner, Raja Giryes, Philipp Krähenbühl

Policy gradient algorithms have driven many recent advancements in language model reasoning. An appealing property is their ability to learn from exploration on their own trajectories, a process crucial for fostering diverse and creative solutions. As we show in this paper, many policy gradient algorithms naturally reduce the entropy—and thus the diversity of explored trajectories—as part of training, yielding a policy increasingly limited in its ability to explore. In this paper, we argue that entropy should be actively monitored and controlled throughout training. We formally analyze the contributions of leading policy gradient objectives on entropy dynamics, identify empirical factors (such as numerical precision) that significantly impact entropy behavior, and propose explicit mechanisms for entropy control. These include REPO, a family of algorithms that modify the advantage function to regulate entropy, and ADAPO, an adaptive asymmetric clipping approach. Models trained with our entropy-preserving methods maintain diversity throughout training, yielding final policies that are more performant and retain their trainability for sequential learning in new environments.

  • † MIT
  • ‡ Equal contribution
  • ** Work done while at Apple**

Related readings and updates.

As language models grow ever larger, so do their vocabularies. This has shifted the memory footprint of LLMs during training disproportionately to one single layer: the cross-entropy in the loss computation. Cross-entropy builds up a logit matrix with entries for each pair of input tokens and vocabulary items and, for small models, consumes an order of magnitude more memory than the rest of the LLM combined. We propose Cut Cross-Entropy (CCE), a…

Read more

m*= Equal Contributors*

Training stability is of great importance to Transformers. In this work, we investigate the training dynamics of Transformers by examining the evolution of the attention layers. In particular, we track the attention entropy for each attention head during the course of training, which is a proxy for model sharpness. We identify a common pattern across different architectures and tasks, where low attention entropy is…

Read more

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by Eigenvector · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

modellanguage modeltraining

Knowledge Map

Knowledge Map
TopicsEntitiesSource
Entropy-Pre…modellanguage mo…trainingpolicyreasoningpaperApple Machi…

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 183 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Models

Я уволил отдел и нанял одного AI-агента
ModelsLive