Alibaba s Qwen team makes AI models think deeper with new algorithm

The Decoderby Jonathan KemperApril 5, 20261 min read0 views

Reinforcement learning hits a wall with reasoning models because every token gets the same reward. A new algorithm from Alibaba's Qwen team fixes this by weighting each step based on how much it shapes what comes next, doubling the length of thought processes in the process. The article Alibaba s Qwen team makes AI models think deeper with new algorithm appeared first on The Decoder .

When a large language model learns to reason through reinforcement learning, it typically gets a simple pass/fail judgment at the end of each generated answer. That reward then gets spread evenly across every single token in the sequence. It doesn't matter whether a token marks the key logical turning point or is just a comma.

The Qwen team says this blunt credit assignment is a major reason why reasoning models hit a ceiling with common training methods like GRPO (Group Relative Policy Optimization). The reasoning chains grow to a certain length and then flatline.

With Future-KL Influenced Policy Optimization (FIPO), the team wants to break through that bottleneck. Instead of scoring each token on its own, the algorithm looks ahead: How does the model's behavior change downstream after generating this particular token?

FIPO calculates the cumulative probability shift across all following tokens and uses that signal to hand out rewards more precisely. Tokens that kick off a productive reasoning chain get a bigger share. Tokens that send the model down a dead end get less.

FIPO matches PPO-based methods without a separate model

Previous attempts to fix the flat reward problem mostly relied on PPO-based methods that use a separate value model to estimate a benefit score for each token.

That auxiliary model typically needs pre-training on long chain-of-thought data, which means outside knowledge leaks in. The researchers say this makes it tough to tell whether the performance gains come from the algorithm itself or are just inherited from the pre-trained helper. FIPO skips the auxiliary model entirely and still delivers comparable results.

FIPO outperforms the baseline along with Deepseek-R1-Zero and o1-mini on the AIME-2024 benchmark during training. | Image: Qwen Pilot Team

To keep training stable, FIPO builds in several guardrails. A discount factor makes sure nearby tokens carry more weight than distant ones, since their downstream influence is harder to predict anyway.

The algorithm also filters out tokens where the model has drifted too far between training steps. Without this filter, the researchers saw severe instabilities: training went off the rails and response lengths cratered.

Without filtering extreme values, FIPO training derails around step 70 and response length drops off a cliff. | Image: Qwen Pilot Team

Thought processes double in length while accuracy climbs

The team tested FIPO on Qwen2.5-32B-Base, a model with zero prior exposure to synthetic long-CoT data. They trained it exclusively on the public dataset from DAPO (Decoupled Clip and Dynamic sAmpling Policy Optimization), a popular open-source GRPO training variant, to keep the comparison fair.

The results are clear-cut. While DAPO's average chain-of-thought length stalls around 4,000 tokens, FIPO pushes past 10,000. On the AIME 2024 math benchmark, accuracy jumps from 50 to 56 percent, peaking at 58 percent. That puts FIPO ahead of both Deepseek-R1-Zero-Math-32B at roughly 47 percent and OpenAI's o1-mini at around 56 percent. On the tougher AIME 2025, scores climb from 38 to 43 percent.

The researchers note it's not just a handful of outliers getting longer. The entire distribution of answer lengths shifts upward, from the shortest to the longest responses. That suggests a fundamental change in how the model approaches problems.

With FIPO, response length grows across the board, not just for outliers. Longer answers track closely with higher accuracy. | Image: Qwen Pilot Team

The model starts fact-checking itself

The paper lays out four phases the model moves through during training. Early on, it churns out shallow planning templates—basically outlines with no real math that end in a hallucinated answer. In the second phase, where DAPO-trained models stay for the rest of training, the model runs a clean linear reasoning chain and stops at the first answer it finds.

In phase three, the model starts spontaneously double-checking its own intermediate results. It reaches an answer but then pivots to a different approach, switching from algebraic manipulation to geometric interpretation, for example, to verify. By phase four, the model runs systematic multi-pass verification, recalculating large square numbers step by step and working through the full derivation multiple times.

The paper notes this behavior looks a lot like the inference-time scaling strategies in OpenAI's o-series and Deepseek-R1, but FIPO pulls it off through reinforcement learning alone, with no long-CoT synthetic data.

Still early days

FIPO was benchmarked only on math problems, trained on a single dataset, and tested only on base models without long-CoT pre-training. The longer sequences also ramp up compute costs. So there's still a lot of testing that needs to be done, according to the team.

Furthermore, whether these gains carry over to other domains like code or symbolic logic is still an open question. There's also a performance gap compared to distilling from larger teacher models. Pure reinforcement learning teaches a model less than direct instruction from a stronger one.

The team says they plan to open-source the training system along with all configurations.

AI News Without the Hype – Curated by Humans

As a THE DECODER subscriber, you get ad-free reading, our weekly AI newsletter, the exclusive "AI Radar" Frontier Report 6× per year, access to comments, and our complete archive.

Subscribe now

Original source

The Decoder

https://the-decoder.com/alibabas-qwen-team-makes-ai-models-think-deeper-with-new-algorithm/

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

modelreasoning

Models

Where have the really big AI models gone? - Transformer | Substack

Where have the really big AI models gone? Transformer | Substack

GNews AI transformer

1m5 months ago

Models

UAE launches ‘sovereign’ open AI model to counter Chinese rivals - Financial Times

UAE launches ‘sovereign’ open AI model to counter Chinese rivals Financial Times

GNews AI UAE

1m2 months ago

ProductsLive

Cortex Code in Snowflake: How to Use It Without Burning Credits

Snowflake Cortex Code (CoCo) is like an AI assistant inside Snowsight (and CLI also). You can ask it to write SQL, create dbt models, explore data, help in ML work, and even do some admin tasks. But one thing people don’t realise early — this tool is powerful, but also costly if used wrongly. Bad prompts → more tokens → more credits → surprise bill. Prompt Engineering (this directly impacts cost) CoCo works on token consumption. what you type → counted 2. what it replies → counted If your prompt is vague → more tool calls → more cost. Example: Bad: Help me with my data Good: Create staging model for RAW.SALES.ORDERS with not_null on ORDER_ID Best Practices: Use full table names 2. Be clear about output 3. Keep prompts small 4. Provide business logic upfront 5. Use AGENTS.md for consistency

Towards AI

3mabout 2 hours ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 140 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

Alibaba s Qwen team makes AI models think deeper with new algorithm

FIPO matches PPO-based methods without a separate model

Thought processes double in length while accuracy climbs

The model starts fact-checking itself

Still early days

AI News Without the Hype – Curated by Humans

Daily AI Digest

More about

Where have the really big AI models gone? - Transformer | Substack

UAE launches ‘sovereign’ open AI model to counter Chinese rivals - Financial Times

Cortex Code in Snowflake: How to Use It Without Burning Credits

Knowledge Map

Connected Articles — Knowledge Graph

Discussion

More in Models

Where have the really big AI models gone? - Transformer | Substack

UAE launches ‘sovereign’ open AI model to counter Chinese rivals - Financial Times

Fears Over U.S. AI Dominance Boost Business for France’s Mistral - WSJ

Mistral AI Raises $830 Million in Debt For Nvidia-Powered Data Center - WSJ