Live
🔥 OpenBMB/ChatDevGitHub Trending🔥 microsoft/agent-lightningGitHub Trending🔥 apache/supersetGitHub Trending🔥 shanraisshan/claude-code-best-practiceGitHub TrendingA-SelecT: Automatic Timestep Selection for Diffusion Transformer Representation LearningarXivGUIDE: Resolving Domain Bias in GUI Agents through Real-Time Web Video Retrieval and Plug-and-Play AnnotationarXivSommelier: Scalable Open Multi-turn Audio Pre-processing for Full-duplex Speech Language ModelsarXivCANGuard: A Spatio-Temporal CNN-GRU-Attention Hybrid Architecture for Intrusion Detection in In-Vehicle CAN NetworksarXivDesignWeaver: Dimensional Scaffolding for Text-to-Image Product DesignarXivA Lightweight, Transferable, and Self-Adaptive Framework for Intelligent DC Arc-Fault Detection in Photovoltaic SystemsarXivConsistency Amplifies: How Behavioral Variance Shapes Agent AccuracyarXivStabilizing Rubric Integration Training via Decoupled Advantage NormalizationarXivSemi-Automated Knowledge Engineering and Process Mapping for Total Airport ManagementarXivAIRA_2: Overcoming Bottlenecks in AI Research AgentsarXivBeSafe-Bench: Unveiling Behavioral Safety Risks of Situated Agents in Functional EnvironmentsarXiv🔥 OpenBMB/ChatDevGitHub Trending🔥 microsoft/agent-lightningGitHub Trending🔥 apache/supersetGitHub Trending🔥 shanraisshan/claude-code-best-practiceGitHub TrendingA-SelecT: Automatic Timestep Selection for Diffusion Transformer Representation LearningarXivGUIDE: Resolving Domain Bias in GUI Agents through Real-Time Web Video Retrieval and Plug-and-Play AnnotationarXivSommelier: Scalable Open Multi-turn Audio Pre-processing for Full-duplex Speech Language ModelsarXivCANGuard: A Spatio-Temporal CNN-GRU-Attention Hybrid Architecture for Intrusion Detection in In-Vehicle CAN NetworksarXivDesignWeaver: Dimensional Scaffolding for Text-to-Image Product DesignarXivA Lightweight, Transferable, and Self-Adaptive Framework for Intelligent DC Arc-Fault Detection in Photovoltaic SystemsarXivConsistency Amplifies: How Behavioral Variance Shapes Agent AccuracyarXivStabilizing Rubric Integration Training via Decoupled Advantage NormalizationarXivSemi-Automated Knowledge Engineering and Process Mapping for Total Airport ManagementarXivAIRA_2: Overcoming Bottlenecks in AI Research AgentsarXivBeSafe-Bench: Unveiling Behavioral Safety Risks of Situated Agents in Functional EnvironmentsarXiv

Stabilizing Rubric Integration Training via Decoupled Advantage Normalization

arXivMarch 30, 202610 min read0 views
Source Quiz

arXiv:2603.26535v1 Announce Type: new Abstract: We propose Process-Aware Policy Optimization (PAPO), a method that integrates process-level evaluation into Group Relative Policy Optimization (GRPO) through decoupled advantage normalization, to address two limitations of existing reward designs. Outcome reward models (ORM) evaluate only final-answer correctness, treating all correct responses identically regardless of reasoning quality, and gradually lose the advantage signal as groups become uniformly correct. Process reward models (PRM) offer richer supervision, but directly using PRM scores — Zelin Tan, Zhouliang Yu, Bohan Lin, Zijie Geng, Hejia Geng, Yudong Zhang, Mulei Zhang, Yang Chen, Shuyue Hu, Zhenfei Yin, Chen Zhang, Lei Bai

Authors:Zelin Tan, Zhouliang Yu, Bohan Lin, Zijie Geng, Hejia Geng, Yudong Zhang, Mulei Zhang, Yang Chen, Shuyue Hu, Zhenfei Yin, Chen Zhang, Lei Bai

View PDF HTML (experimental)

Abstract:We propose Process-Aware Policy Optimization (PAPO), a method that integrates process-level evaluation into Group Relative Policy Optimization (GRPO) through decoupled advantage normalization, to address two limitations of existing reward designs. Outcome reward models (ORM) evaluate only final-answer correctness, treating all correct responses identically regardless of reasoning quality, and gradually lose the advantage signal as groups become uniformly correct. Process reward models (PRM) offer richer supervision, but directly using PRM scores causes reward hacking, where models exploit verbosity to inflate scores while accuracy collapses. PAPO resolves both by composing the advantage from an outcome component Aout, derived from ORM and normalized over all responses, and a process component Aproc, derived from a rubric-based PRM and normalized exclusively among correct responses. This decoupled design ensures that Aout anchors training on correctness while Aproc differentiates reasoning quality without distorting the outcome signal. Experiments across multiple model scales and six benchmarks demonstrate that PAPO consistently outperforms ORM, reaching 51.3% vs.\ 46.3% on OlympiadBench while continuing to improve as ORM plateaus and declines.

Comments: 14 Pages,9 Figures,First Version

Subjects:

Artificial Intelligence (cs.AI)

Cite as: arXiv:2603.26535 [cs.AI]

(or arXiv:2603.26535v1 [cs.AI] for this version)

https://doi.org/10.48550/arXiv.2603.26535

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Zelin Tan [view email] [v1] Fri, 27 Mar 2026 15:48:13 UTC (466 KB)

Original source

arXiv

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by AI News Hub · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

researchpaperarxiv

Knowledge Map

Knowledge Map
TopicsEntitiesSource
Stabilizing…researchpaperarxivaiartificial-…arXiv

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 338 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Research Papers