Research Papers research paper arxiv ai artificial-intelligence

Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes

arXivMarch 26, 202610 min read0 views

On-policy distillation (OPD) is appealing for large language model (LLM) post-training because it evaluates teacher feedback on student-generated rollouts rather than fixed teacher traces. In long-horizon settings, however, the common sampled-token variant is fragile: it reduces distribution matching to a one-token signal and becomes increasingly unreliable as rollouts drift away from prefixes the teacher commonly visits. We revisit OPD from the estimator and implementation sides. Theoretically, token-level OPD is biased relative to sequence-level reverse-KL, but it has a much tighter worst-ca — Yuqian Fu, Haohuan Huang, Kaiwen Jiang

View PDF HTML (experimental)

Abstract:On-policy distillation (OPD) is appealing for large language model (LLM) post-training because it evaluates teacher feedback on student-generated rollouts rather than fixed teacher traces. In long-horizon settings, however, the common sampled-token variant is fragile: it reduces distribution matching to a one-token signal and becomes increasingly unreliable as rollouts drift away from prefixes the teacher commonly visits. We revisit OPD from the estimator and implementation sides. Theoretically, token-level OPD is biased relative to sequence-level reverse-KL, but it has a much tighter worst-case variance bound; our toy study shows the same tradeoff empirically, with stronger future-reward coupling producing higher gradient variance and less stable learning. Empirically, we identify three failure modes of sampled-token OPD: an imbalanced one-token signal, unreliable teacher guidance on student-generated prefixes, and distortions caused by tokenizer or special-token mismatch. We address these issues with teacher top-K local support matching, implemented as truncated reverse-KL with top-p rollout sampling and special-token masking. Across single-task math reasoning and multi-task agentic-plus-math training, this objective yields more stable optimization and better downstream performance than sampled-token OPD.

Subjects:

Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

Cite as: arXiv:2603.25562 [cs.LG]

(or arXiv:2603.25562v1 [cs.LG] for this version)

https://doi.org/10.48550/arXiv.2603.25562

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Yuqian Fu [view email] [v1] Thu, 26 Mar 2026 15:35:59 UTC (3,382 KB)

Original source

arXiv

https://arxiv.org/abs/2603.25562v1

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

researchpaperarxiv

Research PapersFresh

[R] Looking for arXiv cs.LG endorser, inference monitoring using information geometry

Hi r/MachineLearning , I’m looking for an arXiv endorser in cs.LG for a paper on inference-time distribution shift detection for deployed LLMs. The core idea: instead of monitoring input embeddings (which is what existing tools do), we monitor the statistical manifold of the model’s output distributions using Fisher-Rao geodesic distance. We then run adaptive CUSUM (Page-Hinkley) on the resulting z-score stream to catch slow drift that per-request spike detection misses entirely. The methodology is grounded in published work on information geometry (Figshare, DOIs available). We’ve validated the signal on real OpenAI API logprobs, CUSUM caught gradual domain drift in 7 steps with zero false alarms during warmup, while spike detection missed it entirely. If anyone with cs.LG endorsement is

Reddit r/MachineLearning

1mabout 3 hours ago

CountriesFresh

How can Beijing attract top-tier Chinese AI professionals based abroad?

Beijing should shift its strategy and improve ways to attract and retain top Chinese AI professionals as America’s accelerating integration of artificial intelligence into military and national security systems puts such talent in a bind. As geopolitical tensions rise, many highly skilled Chinese researchers working at US tech and research institutions are confronting a painful dilemma, according to Dai Mingjie, a researcher at the Institute of Public Policy at the Guangzhou-based South China...

SCMP Tech (Asia AI)

1mabout 2 hours ago

ProductsFresh

Anthropic says Claude subscriptions will no longer support OpenClaw because it puts an 'outsized strain' on systems

Why It Matters The decision by Anthropic to stop supporting OpenClaw for Claude subscriptions is significant because it highlights the challenges of integrating third-party tools with AI systems. According to a report from Business Insider, Anthropic cited the "outsized strain" that tools like OpenClaw put on their systems as the reason for this move. This strain is likely due to the additional computational resources required to support these tools, which can impact the overall performance and reliability of the AI system. The impact of this decision will be felt by users who rely on OpenClaw to enhance their experience with Claude subscriptions. OpenClaw's founder has already expressed disappointment, stating that cutting support would be "a loss." This reaction is understandable, given

Dev.to AI

3mabout 2 hours ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 233 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Research Papers

Research PapersFresh

[R] Looking for arXiv cs.LG endorser, inference monitoring using information geometry

Reddit r/MachineLearning

1mabout 3 hours ago

Research PapersRecent

How AI Is Re‑Architecting Industrial Procurement and Supply Chain - Emerj Artificial Intelligence Research

How AI Is Re‑Architecting Industrial Procurement and Supply Chain Emerj Artificial Intelligence Research

GNews AI manufacturing

1m1 day ago

Research PapersFresh

Towards end-to-end automation of AI research

Article URL: https://www.nature.com/articles/s41586-026-10265-5 Comments URL: https://news.ycombinator.com/item?id=47645696 Points: 3 # Comments: 0

Hacker News AI Top

1mabout 3 hours ago

Research PapersFresh

[D] KDD Review Discussion

KDD 2026 (Feb Cycle) reviews will release today (4-April AoE), This thread is open to discuss about reviews and importantly celebrate successful reviews. Let us all remember that review system is noisy and we all suffer from it and this doesn't define our research impact. Let's all prioritise reviews which enhance our papers. Feel free to discuss your experiences submitted by /u/BomsDrag [link] [comments]

Reddit r/MachineLearning

1mabout 11 hours ago