Live
Black Hat USADark ReadingBlack Hat AsiaAI BusinessAI tools are great for individuals. but what about your team?DEV CommunityOpenAI: We’re generating $2 billion a month - thestack.technologyGoogle News: OpenAIBeyond Human Wisdom: Can Humanity Survive the Rise of AGI?LessWrong AICreate a workspace scheduler using Bryntum Scheduler Pro and MongoDBDEV CommunityNvidia commits billions to Lumentum, Synopsys, Nokia, XAI, OpenAI, Intel in March alone - 24/7 Wall St.Google News: OpenAIDiscover a Free AI Voice Tool with Emotional Control for Content CreatorsDEV CommunitySeatGeek launches its app in ChatGPT - IQ MagazineGoogle News: ChatGPTI tested denim jackets from Banana Republic, Old Navy, and Gap. One became my new closet staple.Business InsiderReact 20 Is Coming. Here's What Actually Matters (and What Doesn't).DEV CommunityAsync/Await in JavaScript: Writing Cleaner Asynchronous CodeDEV CommunityThe 3-Prompt Rule: Why Limiting AI Turns Produces Better CodeDEV CommunityNancy Guthrie Update: New Details Provide 'Puzzling' Context to Savannah's Mum's DisappearanceInternational Business TimesBlack Hat USADark ReadingBlack Hat AsiaAI BusinessAI tools are great for individuals. but what about your team?DEV CommunityOpenAI: We’re generating $2 billion a month - thestack.technologyGoogle News: OpenAIBeyond Human Wisdom: Can Humanity Survive the Rise of AGI?LessWrong AICreate a workspace scheduler using Bryntum Scheduler Pro and MongoDBDEV CommunityNvidia commits billions to Lumentum, Synopsys, Nokia, XAI, OpenAI, Intel in March alone - 24/7 Wall St.Google News: OpenAIDiscover a Free AI Voice Tool with Emotional Control for Content CreatorsDEV CommunitySeatGeek launches its app in ChatGPT - IQ MagazineGoogle News: ChatGPTI tested denim jackets from Banana Republic, Old Navy, and Gap. One became my new closet staple.Business InsiderReact 20 Is Coming. Here's What Actually Matters (and What Doesn't).DEV CommunityAsync/Await in JavaScript: Writing Cleaner Asynchronous CodeDEV CommunityThe 3-Prompt Rule: Why Limiting AI Turns Produces Better CodeDEV CommunityNancy Guthrie Update: New Details Provide 'Puzzling' Context to Savannah's Mum's DisappearanceInternational Business Times

FIPO: Eliciting Deep Reasoning with Future-KL Influenced Policy Optimization

HuggingFace PapersMarch 20, 20268 min read0 views
Source Quiz

FIPO enhances reinforcement learning for language models by using discounted future-KL divergence to improve credit assignment and extend reasoning chains, achieving better mathematical problem-solving performance. (168 upvotes on HuggingFace)

Abstract

FIPO enhances reinforcement learning for language models by using discounted future-KL divergence to improve credit assignment and extend reasoning chains, achieving better mathematical problem-solving performance.

AI-generated summary

We present Future-KL Influenced Policy Optimization (FIPO), a reinforcement learning algorithm designed to overcome reasoning bottlenecks in large language models. While GRPO style training scales effectively, it typically relies on outcome-based rewards (ORM) that distribute a global advantage uniformly across every token in a trajectory. We argue that this coarse-grained credit assignment imposes a performance ceiling by failing to distinguish critical logical pivots from trivial tokens. FIPO addresses this by incorporating discounted future-KL divergence into the policy update, creating a dense advantage formulation that re-weights tokens based on their influence on subsequent trajectory behavior. Empirically, FIPO enables models to break through the length stagnation seen in standard baselines. Evaluated on Qwen2.5-32B, FIPO extends the average chain-of-thought length from roughly 4,000 to over 10,000 tokens and increases AIME 2024 Pass@1 accuracy from 50.0% to a peak of 58.0% (converging at approximately 56.0%). This outperforms both DeepSeek-R1-Zero-Math-32B (around 47.0%) and o1-mini (approximately 56.0%). Our results suggest that establishing dense advantage formulations is a vital path for evolving ORM-based algorithms to unlock the full reasoning potential of base models. We open-source our training system, built on the verl framework.

View arXiv page View PDF Project page GitHub 42 Add to collection

Get this paper in your agent:

hf papers read 2603.19835

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2603.19835 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2603.19835 in a Space README.md to link it from this page.

Collections including this paper 2

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by AI News Hub · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

researchpaperarxiv

Knowledge Map

Knowledge Map
TopicsEntitiesSource
FIPO: Elici…researchpaperarxivreinforceme…policy opti…discounted …HuggingFace…

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 208 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Research Papers