Live
Black Hat USAAI BusinessBlack Hat AsiaAI BusinessWhy Microservices Struggle With AI SystemsHackernoon AIAgentic AI Vision System: Object Segmentation with SAM 3 and QwenPyImageSearchWhy APEX Matters for MoE Coding Models and why it's NOT the same as K quantsReddit r/LocalLLaMAGoogle Study: AI Benchmarks Use Too Few Raters to Be Reliable - WinBuzzerGNews AI benchmarkNvidia Stock Rises. This Issue Could Hamper Its Next-Generation AI Chips. - Barron'sGNews AI NVIDIABroadcom's CEO Has Line of Sight to $100 Billion in AI Chip Revenue. Is the Stock a Buy? - The Motley FoolGoogle News: AI‘This is 160-million-year-old Jurassic clay’: inside Es Devlin’s bid to reshape AI ethics – through potteryThe Guardian AI‘This is 160-million-year-old Jurassic clay’: inside Es Devlin’s bid to reshape AI ethics – through pottery - The GuardianGNews AI ethicsI gave Claude Code our entire codebase. Our customers noticed. | Al Chen (Galileo)lennysnewsletter.comGoogle DeepMind and Agile Robotics Combine Robotics Platforms - Automation WorldGoogle News: DeepMindRoche Launches AI Factory with NVIDIA to Accelerate Drug Discovery and Diagnostics - The Healthcare Technology Report.GNews AI NVIDIABig Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.Dev.to AIBlack Hat USAAI BusinessBlack Hat AsiaAI BusinessWhy Microservices Struggle With AI SystemsHackernoon AIAgentic AI Vision System: Object Segmentation with SAM 3 and QwenPyImageSearchWhy APEX Matters for MoE Coding Models and why it's NOT the same as K quantsReddit r/LocalLLaMAGoogle Study: AI Benchmarks Use Too Few Raters to Be Reliable - WinBuzzerGNews AI benchmarkNvidia Stock Rises. This Issue Could Hamper Its Next-Generation AI Chips. - Barron'sGNews AI NVIDIABroadcom's CEO Has Line of Sight to $100 Billion in AI Chip Revenue. Is the Stock a Buy? - The Motley FoolGoogle News: AI‘This is 160-million-year-old Jurassic clay’: inside Es Devlin’s bid to reshape AI ethics – through potteryThe Guardian AI‘This is 160-million-year-old Jurassic clay’: inside Es Devlin’s bid to reshape AI ethics – through pottery - The GuardianGNews AI ethicsI gave Claude Code our entire codebase. Our customers noticed. | Al Chen (Galileo)lennysnewsletter.comGoogle DeepMind and Agile Robotics Combine Robotics Platforms - Automation WorldGoogle News: DeepMindRoche Launches AI Factory with NVIDIA to Accelerate Drug Discovery and Diagnostics - The Healthcare Technology Report.GNews AI NVIDIABig Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.Dev.to AI
AI NEWS HUBbyEIGENVECTOREigenvector

Beyond Log Likelihood: Probability-Based Objectives for Supervised Fine-Tuning across the Model Capability Continuum

arXivMarch 30, 202610 min read0 views
Source Quiz

arXiv:2510.00526v2 Announce Type: replace-cross Abstract: Supervised fine-tuning (SFT) is the standard approach for post-training large language models (LLMs), yet it often shows limited generalization. We trace this limitation to its default training objective: negative log likelihood (NLL). While NLL is classically optimal when training from scratch, post-training operates in a different paradigm and could violate its optimality assumptions, where models already encode task-relevant priors and supervision can be long and noisy. Rather than proposing a single universally superior replacement — Gaotang Li, Ruizhong Qiu, Xiusi Chen, Heng Ji, Hanghang Tong

View PDF HTML (experimental)

Abstract:Supervised fine-tuning (SFT) is the standard approach for post-training large language models (LLMs), yet it often shows limited generalization. We trace this limitation to its default training objective: negative log likelihood (NLL). While NLL is classically optimal when training from scratch, post-training operates in a different paradigm and could violate its optimality assumptions, where models already encode task-relevant priors and supervision can be long and noisy. Rather than proposing a single universally superior replacement loss, we systematically study various probability-based objectives and characterize when and why different objectives succeed or fail under varying conditions. Through comprehensive experiments and extensive ablation studies across 8 model backbones, 27 benchmarks, and 7 domains, we uncover a critical dimension that governs objective behavior: the model-capability continuum. Near the model-strong end, prior-leaning objectives that downweight low-probability tokens (e.g., $-p$, $-p^{10}$, thresholded variants) consistently outperform NLL; toward the model-weak end, NLL dominates; in between, no single objective prevails. Our theoretical analysis further elucidates how objectives trade places across the continuum, providing a principled foundation for adapting objectives to model capability. The code is provided at this https URL.

Comments: 28 pages, 6 figures

Subjects:

Computation and Language (cs.CL); Machine Learning (cs.LG)

Cite as: arXiv:2510.00526 [cs.CL]

(or arXiv:2510.00526v2 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2510.00526

arXiv-issued DOI via DataCite

Submission history

From: Gaotang Li [view email] [v1] Wed, 1 Oct 2025 05:17:47 UTC (233 KB) [v2] Fri, 27 Mar 2026 05:33:29 UTC (271 KB)

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by Eigenvector · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

researchpaperarxiv

Knowledge Map

Knowledge Map
TopicsEntitiesSource
Beyond Log …researchpaperarxivmachine-lea…deep-learni…arXiv

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 202 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Research Papers