Live
Black Hat USADark ReadingBlack Hat AsiaAI BusinessOpenAI isn’t just buying a podcast — it’s buying influence - News Channel 3-12Google News: OpenAIOpenAI isn’t just buying a podcast — it’s buying influence - CNNGoogle News: OpenAIThe best iPad deals you can get right nowThe VergeFederal Government Used ChatGPT for Decision to Cut Grant at North Carolina Central University - The Journal of Blacks in Higher EducationGoogle News: ChatGPTMarc Andreessen on AI Winters and Agent Breakthroughsa16z PodcastArtificial Intelligence News for the Week of April 2; Updates from DataCamp, Insight Jam, LearnUpon & More - Solutions ReviewGoogle News: AIHow AI Platforms Can Earn Long-Term Trust - Wealth ManagementGoogle News: AINASA's Artemis II Lifts Off | Bloomberg Artemis II Special 4/1/2026Bloomberg TechnologyMachine Learning: Teaching Computers to Learn from DataMedium AIBiological neural networks may serve as viable alternatives to machine learning models - news-medical.netGoogle News: Machine LearningAI Won’t Make Us Obsolete. It Will Make Our Ambitions Look Small.Medium AINieuwe Rowhammer-aanvallen kunnen pc s met sommige Nvidia-gpu s overnemenTweakers.netBlack Hat USADark ReadingBlack Hat AsiaAI BusinessOpenAI isn’t just buying a podcast — it’s buying influence - News Channel 3-12Google News: OpenAIOpenAI isn’t just buying a podcast — it’s buying influence - CNNGoogle News: OpenAIThe best iPad deals you can get right nowThe VergeFederal Government Used ChatGPT for Decision to Cut Grant at North Carolina Central University - The Journal of Blacks in Higher EducationGoogle News: ChatGPTMarc Andreessen on AI Winters and Agent Breakthroughsa16z PodcastArtificial Intelligence News for the Week of April 2; Updates from DataCamp, Insight Jam, LearnUpon & More - Solutions ReviewGoogle News: AIHow AI Platforms Can Earn Long-Term Trust - Wealth ManagementGoogle News: AINASA's Artemis II Lifts Off | Bloomberg Artemis II Special 4/1/2026Bloomberg TechnologyMachine Learning: Teaching Computers to Learn from DataMedium AIBiological neural networks may serve as viable alternatives to machine learning models - news-medical.netGoogle News: Machine LearningAI Won’t Make Us Obsolete. It Will Make Our Ambitions Look Small.Medium AINieuwe Rowhammer-aanvallen kunnen pc s met sommige Nvidia-gpu s overnemenTweakers.net
AI NEWS HUBbyEIGENVECTOREigenvector

Diversity-Aware Reverse Kullback-Leibler Divergence for Large Language Model Distillation

arXiv cs.LGby Hoang-Chau Luong, Dat Ba Tran, Lingwei ChenApril 2, 20261 min read0 views
Source Quiz

arXiv:2604.00223v1 Announce Type: new Abstract: Reverse Kullback-Leibler (RKL) divergence has recently emerged as the preferred objective for large language model (LLM) distillation, consistently outperforming forward KL (FKL), particularly in regimes with large vocabularies and significant teacher-student capacity mismatch, where RKL focuses learning on dominant modes rather than enforcing dense alignment. However, RKL introduces a structural limitation that drives the student toward overconfident predictions. We first provide an analysis of RKL by decomposing its gradients into target and non-target components, and show that non-target gradients consistently push the target logit upward even when the student already matches the teacher, thereby reducing output diversity. In addition, RKL

View PDF HTML (experimental)

Abstract:Reverse Kullback-Leibler (RKL) divergence has recently emerged as the preferred objective for large language model (LLM) distillation, consistently outperforming forward KL (FKL), particularly in regimes with large vocabularies and significant teacher-student capacity mismatch, where RKL focuses learning on dominant modes rather than enforcing dense alignment. However, RKL introduces a structural limitation that drives the student toward overconfident predictions. We first provide an analysis of RKL by decomposing its gradients into target and non-target components, and show that non-target gradients consistently push the target logit upward even when the student already matches the teacher, thereby reducing output diversity. In addition, RKL provides weak supervision over non-target classes, leading to poor tail alignment. To address these issues, we propose Diversity-aware RKL (DRKL), which removes this gradient effect and strengthens non-target supervision while preserving the optimization benefits of RKL. Extensive experiments across datasets and model families demonstrate that DRKL consistently outperforms FKL, RKL, and other state-of-the-art distillation objectives, achieving better performance and a superior fidelity-diversity trade-off.

Subjects:

Machine Learning (cs.LG); Artificial Intelligence (cs.AI)

Cite as: arXiv:2604.00223 [cs.LG]

(or arXiv:2604.00223v1 [cs.LG] for this version)

https://doi.org/10.48550/arXiv.2604.00223

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Hoang-Chau Luong [view email] [v1] Tue, 31 Mar 2026 20:39:47 UTC (886 KB)

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by Eigenvector · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

Knowledge Map

Knowledge Map
TopicsEntitiesSource
Diversity-A…modellanguage mo…announceanalysispredictionalignmentarXiv cs.LG

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 162 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!