Models model language model announce analysis prediction alignment

Diversity-Aware Reverse Kullback-Leibler Divergence for Large Language Model Distillation

arXiv cs.LGby Hoang-Chau Luong, Dat Ba Tran, Lingwei ChenApril 2, 20261 min read0 views

arXiv:2604.00223v1 Announce Type: new Abstract: Reverse Kullback-Leibler (RKL) divergence has recently emerged as the preferred objective for large language model (LLM) distillation, consistently outperforming forward KL (FKL), particularly in regimes with large vocabularies and significant teacher-student capacity mismatch, where RKL focuses learning on dominant modes rather than enforcing dense alignment. However, RKL introduces a structural limitation that drives the student toward overconfident predictions. We first provide an analysis of RKL by decomposing its gradients into target and non-target components, and show that non-target gradients consistently push the target logit upward even when the student already matches the teacher, thereby reducing output diversity. In addition, RKL

View PDF HTML (experimental)

Abstract:Reverse Kullback-Leibler (RKL) divergence has recently emerged as the preferred objective for large language model (LLM) distillation, consistently outperforming forward KL (FKL), particularly in regimes with large vocabularies and significant teacher-student capacity mismatch, where RKL focuses learning on dominant modes rather than enforcing dense alignment. However, RKL introduces a structural limitation that drives the student toward overconfident predictions. We first provide an analysis of RKL by decomposing its gradients into target and non-target components, and show that non-target gradients consistently push the target logit upward even when the student already matches the teacher, thereby reducing output diversity. In addition, RKL provides weak supervision over non-target classes, leading to poor tail alignment. To address these issues, we propose Diversity-aware RKL (DRKL), which removes this gradient effect and strengthens non-target supervision while preserving the optimization benefits of RKL. Extensive experiments across datasets and model families demonstrate that DRKL consistently outperforms FKL, RKL, and other state-of-the-art distillation objectives, achieving better performance and a superior fidelity-diversity trade-off.

Subjects:

Machine Learning (cs.LG); Artificial Intelligence (cs.AI)

Cite as: arXiv:2604.00223 [cs.LG]

(or arXiv:2604.00223v1 [cs.LG] for this version)

https://doi.org/10.48550/arXiv.2604.00223

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Hoang-Chau Luong [view email] [v1] Tue, 31 Mar 2026 20:39:47 UTC (886 KB)

Original source

arXiv cs.LG

https://arxiv.org/abs/2604.00223

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

modellanguage modelannounce

ModelsLive

GPT-5.1 Codex, GPT-5.1-Codex-Max, and GPT-5.1-Codex-Mini deprecated

We have deprecated the following models across all GitHub Copilot experiences (including Copilot Chat, inline edits, ask and agent modes, and code completions) on April 1, 2026. Model Deprecation date The post GPT-5.1 Codex, GPT-5.1-Codex-Max, and GPT-5.1-Codex-Mini deprecated appeared first on The GitHub Blog .

GitHub Copilot Changelog

1m37 minutes ago

Models

Exclusive | Caltech Researchers Claim Radical Compression of High-Fidelity AI Models - WSJ

Exclusive | Caltech Researchers Claim Radical Compression of High-Fidelity AI Models WSJ

Google News: LLM

1m3 days ago

ModelsLive

Claude AI Just Leveled Up: Why Its New Features Actually Matter for Creators

If you’re like me, you’re probably suffering from “AI fatigue.” Every week, there’s a new model, a new benchmark, and a new “ChatGPT… Continue reading on Medium »

Medium AI

1m40 minutes ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 162 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

Diversity-Aware Reverse Kullback-Leibler Divergence for Large Language Model Distillation

Submission history

Daily AI Digest

More about

GPT-5.1 Codex, GPT-5.1-Codex-Max, and GPT-5.1-Codex-Mini deprecated

Exclusive | Caltech Researchers Claim Radical Compression of High-Fidelity AI Models - WSJ

Claude AI Just Leveled Up: Why Its New Features Actually Matter for Creators

Knowledge Map

Connected Articles — Knowledge Graph

Discussion

More in Models

AI for Systems: Using LLMs to Optimize Database Query Execution

GPT-5.1 Codex, GPT-5.1-Codex-Max, and GPT-5.1-Codex-Mini deprecated

Exclusive | Caltech Researchers Claim Radical Compression of High-Fidelity AI Models - WSJ

Claude AI Just Leveled Up: Why Its New Features Actually Matter for Creators