Live
Black Hat USADark ReadingBlack Hat AsiaAI Business🚀 Build a Professional Image Converter GUI in Python (Step-by-Step)DEV CommunityClaude Code Hooks: How to Auto-Format, Lint, and Test on Every SaveDev.to AIFunctional Emotions in Large Language Models: What Anthropic Found Inside ClaudeMedium AIWhy Nobody Is Testing AI Agent Security at Scale — And How Swarm Simulation Could Change ThatDev.to AIThe 10 Claude “Plugins” You Actually Need in 2026Medium AIHow AI Is Changing the Way We Build Online BusinessesDev.to AI5 Patterns for Building Resilient Event-Driven IntegrationsDEV CommunityAGI Won’t Automate Most Jobs—Economist Reveals Why They’re Not Worth ItDev.to AIThe AI Agent's Guide to Building a Writing PortfolioDev.to AIMy Claude Code Buddy Moved Into My MacBook's Notch and I Can't Stop Looking at ItDEV CommunityChoosing an AI Agent Orchestrator in 2026: A Practical ComparisonDev.to AII Turned My MacBook's Notch Into a Control Center for AI Coding AgentsDEV CommunityBlack Hat USADark ReadingBlack Hat AsiaAI Business🚀 Build a Professional Image Converter GUI in Python (Step-by-Step)DEV CommunityClaude Code Hooks: How to Auto-Format, Lint, and Test on Every SaveDev.to AIFunctional Emotions in Large Language Models: What Anthropic Found Inside ClaudeMedium AIWhy Nobody Is Testing AI Agent Security at Scale — And How Swarm Simulation Could Change ThatDev.to AIThe 10 Claude “Plugins” You Actually Need in 2026Medium AIHow AI Is Changing the Way We Build Online BusinessesDev.to AI5 Patterns for Building Resilient Event-Driven IntegrationsDEV CommunityAGI Won’t Automate Most Jobs—Economist Reveals Why They’re Not Worth ItDev.to AIThe AI Agent's Guide to Building a Writing PortfolioDev.to AIMy Claude Code Buddy Moved Into My MacBook's Notch and I Can't Stop Looking at ItDEV CommunityChoosing an AI Agent Orchestrator in 2026: A Practical ComparisonDev.to AII Turned My MacBook's Notch Into a Control Center for AI Coding AgentsDEV Community
AI NEWS HUBbyEIGENVECTOREigenvector

Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes

arXivMarch 26, 202610 min read0 views
Source Quiz

On-policy distillation (OPD) is appealing for large language model (LLM) post-training because it evaluates teacher feedback on student-generated rollouts rather than fixed teacher traces. In long-horizon settings, however, the common sampled-token variant is fragile: it reduces distribution matching to a one-token signal and becomes increasingly unreliable as rollouts drift away from prefixes the teacher commonly visits. We revisit OPD from the estimator and implementation sides. Theoretically, token-level OPD is biased relative to sequence-level reverse-KL, but it has a much tighter worst-ca — Yuqian Fu, Haohuan Huang, Kaiwen Jiang

View PDF HTML (experimental)

Abstract:On-policy distillation (OPD) is appealing for large language model (LLM) post-training because it evaluates teacher feedback on student-generated rollouts rather than fixed teacher traces. In long-horizon settings, however, the common sampled-token variant is fragile: it reduces distribution matching to a one-token signal and becomes increasingly unreliable as rollouts drift away from prefixes the teacher commonly visits. We revisit OPD from the estimator and implementation sides. Theoretically, token-level OPD is biased relative to sequence-level reverse-KL, but it has a much tighter worst-case variance bound; our toy study shows the same tradeoff empirically, with stronger future-reward coupling producing higher gradient variance and less stable learning. Empirically, we identify three failure modes of sampled-token OPD: an imbalanced one-token signal, unreliable teacher guidance on student-generated prefixes, and distortions caused by tokenizer or special-token mismatch. We address these issues with teacher top-K local support matching, implemented as truncated reverse-KL with top-p rollout sampling and special-token masking. Across single-task math reasoning and multi-task agentic-plus-math training, this objective yields more stable optimization and better downstream performance than sampled-token OPD.

Subjects:

Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

Cite as: arXiv:2603.25562 [cs.LG]

(or arXiv:2603.25562v1 [cs.LG] for this version)

https://doi.org/10.48550/arXiv.2603.25562

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Yuqian Fu [view email] [v1] Thu, 26 Mar 2026 15:35:59 UTC (3,382 KB)

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by Eigenvector · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

researchpaperarxiv

Knowledge Map

Knowledge Map
TopicsEntitiesSource
Revisiting …researchpaperarxivaiartificial-…arXiv

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 233 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Research Papers