Live
Black Hat USADark ReadingBlack Hat AsiaAI BusinessMistral Leads a Week of European Infrastructure Plays - Startup FortuneGNews AI MistralInside the Creative Artificial Intelligence (AI) Stack: Where Human Vision and Artificial Intelligence Meet to Design Future FashionMarkTechPostRecap: Europe’s top funding rounds this week (30 March – 5 April)The Next Web AIZenaTech (ZENA) Is Up 8.7% After Launching Ukraine Drone Hub And Expanding AI Defense Platform - simplywall.stGoogle News - AI UkraineHow AI and Alternative Data Are Finally Making Germany's Hidden Champions Accessible to Global InvestorsDev.to AIThe Hidden Auditory Knowledge Inside Language ModelsHackernoon AIThe Simple Truth About AI Agent RevenueDev.to AIAI Transformation in German SMEs: McKinsey Data Shows Up to 10x ROI from Strategic AI IntegrationDev.to AIAutomating Your Urban Farm with AI: From Guesswork to PrecisionDev.to AIThe Real Ceiling in Claude Code's Memory System (It’s Not the 200-Line Cap)Dev.to AIThe Invisible Rhythms of the Siuntio FortDev.to AIXYRONIXDEV CommunityBlack Hat USADark ReadingBlack Hat AsiaAI BusinessMistral Leads a Week of European Infrastructure Plays - Startup FortuneGNews AI MistralInside the Creative Artificial Intelligence (AI) Stack: Where Human Vision and Artificial Intelligence Meet to Design Future FashionMarkTechPostRecap: Europe’s top funding rounds this week (30 March – 5 April)The Next Web AIZenaTech (ZENA) Is Up 8.7% After Launching Ukraine Drone Hub And Expanding AI Defense Platform - simplywall.stGoogle News - AI UkraineHow AI and Alternative Data Are Finally Making Germany's Hidden Champions Accessible to Global InvestorsDev.to AIThe Hidden Auditory Knowledge Inside Language ModelsHackernoon AIThe Simple Truth About AI Agent RevenueDev.to AIAI Transformation in German SMEs: McKinsey Data Shows Up to 10x ROI from Strategic AI IntegrationDev.to AIAutomating Your Urban Farm with AI: From Guesswork to PrecisionDev.to AIThe Real Ceiling in Claude Code's Memory System (It’s Not the 200-Line Cap)Dev.to AIThe Invisible Rhythms of the Siuntio FortDev.to AIXYRONIXDEV Community
AI NEWS HUBbyEIGENVECTOREigenvector

On the Role of Reasoning Patterns in the Generalization Discrepancy of Long Chain-of-Thought Supervised Fine-Tuning

arXiv cs.CLby [Submitted on 2 Apr 2026]April 4, 20262 min read1 views
Source Quiz

arXiv:2604.01702v1 Announce Type: new Abstract: Supervised Fine-Tuning (SFT) on long Chain-of-Thought (CoT) trajectories has become a pivotal phase in building large reasoning models. However, how CoT trajectories from different sources influence the generalization performance of models remains an open question. In this paper, we conduct a comparative study using two sources of verified CoT trajectories generated by two competing models, \texttt{DeepSeek-R1-0528} and \texttt{gpt-oss-120b}, with their problem sets controlled to be identical. Despite their comparable performance, we uncover a striking paradox: lower training loss does not translate to better generalization. SFT on \texttt{DeepSeek-R1-0528} data achieves remarkably lower training loss, yet exhibits significantly worse general

View PDF HTML (experimental)

Abstract:Supervised Fine-Tuning (SFT) on long Chain-of-Thought (CoT) trajectories has become a pivotal phase in building large reasoning models. However, how CoT trajectories from different sources influence the generalization performance of models remains an open question. In this paper, we conduct a comparative study using two sources of verified CoT trajectories generated by two competing models, \texttt{DeepSeek-R1-0528} and \texttt{gpt-oss-120b}, with their problem sets controlled to be identical. Despite their comparable performance, we uncover a striking paradox: lower training loss does not translate to better generalization. SFT on \texttt{DeepSeek-R1-0528} data achieves remarkably lower training loss, yet exhibits significantly worse generalization performance on reasoning benchmarks compared to those trained on \texttt{gpt-oss-120b}. To understand this paradox, we perform a multi-faceted analysis probing token-level SFT loss and step-level reasoning behaviors. Our analysis reveals a difference in reasoning patterns. \texttt{gpt-oss-120b} exhibits highly convergent and deductive trajectories, whereas \texttt{DeepSeek-R1-0528} favors a divergent and branch-heavy exploration pattern. Consequently, models trained with \texttt{DeepSeek-R1} data inherit inefficient exploration behaviors, often getting trapped in redundant exploratory branches that hinder them from reaching correct solutions. Building upon this insight, we propose a simple yet effective remedy of filtering out frequently branching trajectories to improve the generalization of SFT. Experiments show that training on selected \texttt{DeepSeek-R1-0528} subsets surprisingly improves reasoning performance by up to 5.1% on AIME25, 5.5% on BeyondAIME, and on average 3.6% on five benchmarks.

Comments: Under Review

Subjects:

Computation and Language (cs.CL)

Cite as: arXiv:2604.01702 [cs.CL]

(or arXiv:2604.01702v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2604.01702

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Zhaoyi Li [view email] [v1] Thu, 2 Apr 2026 07:00:54 UTC (8,772 KB)

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by Eigenvector · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

Knowledge Map

Knowledge Map
TopicsEntitiesSource
On the Role…modelbenchmarktrainingannounceanalysisinsightarXiv cs.CL

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 133 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!