Live
Black Hat USADark ReadingBlack Hat AsiaAI BusinessWhich countries use ChatGPT the most? New study reveals top 5 - Deseret NewsGoogle News: ChatGPTOpenAI Is Letting Individuals Invest in Its $852 Billion Valuation—Here’s How - inc.comGoogle News: OpenAITransition From Data Scientist to Machine Learning Engineer 2026 Guide - Interview Kickstart Publishes New Career Guide - The Manila TimesGoogle News: Machine LearningValuations are 'Punchy': Salesforce's DrewsBloomberg TechnologyEarly AI Use Risks Children’s Development, Safety: UN - Mexico Business NewsGoogle News: AI SafetyAI blueprints can be stolen with a single small antennaTechXplore AIYou Have to Start Early in AI: Axiom Founder VenkatachalamBloomberg TechnologyAI and the Work-Product Doctrine: A New Frontier - callaborlaw.comGoogle News: AICompliance Policies: AI Policy & Upcoming Incident Response Plan Deadline - natlawreview.comGoogle News: AI‘Boring’ Liberty Formula One Upgraded To Buy at Bank of AmericaBloomberg TechnologyCan You Run a Computer Without RAM? Surprisingly, Yes—But You’ll Be MiserableGizmodoAI maps science papers to predict research trends two to three years ahead - Tech XploreGoogle News: Machine LearningBlack Hat USADark ReadingBlack Hat AsiaAI BusinessWhich countries use ChatGPT the most? New study reveals top 5 - Deseret NewsGoogle News: ChatGPTOpenAI Is Letting Individuals Invest in Its $852 Billion Valuation—Here’s How - inc.comGoogle News: OpenAITransition From Data Scientist to Machine Learning Engineer 2026 Guide - Interview Kickstart Publishes New Career Guide - The Manila TimesGoogle News: Machine LearningValuations are 'Punchy': Salesforce's DrewsBloomberg TechnologyEarly AI Use Risks Children’s Development, Safety: UN - Mexico Business NewsGoogle News: AI SafetyAI blueprints can be stolen with a single small antennaTechXplore AIYou Have to Start Early in AI: Axiom Founder VenkatachalamBloomberg TechnologyAI and the Work-Product Doctrine: A New Frontier - callaborlaw.comGoogle News: AICompliance Policies: AI Policy & Upcoming Incident Response Plan Deadline - natlawreview.comGoogle News: AI‘Boring’ Liberty Formula One Upgraded To Buy at Bank of AmericaBloomberg TechnologyCan You Run a Computer Without RAM? Surprisingly, Yes—But You’ll Be MiserableGizmodoAI maps science papers to predict research trends two to three years ahead - Tech XploreGoogle News: Machine Learning

ShapE-GRPO: Shapley-Enhanced Reward Allocation for Multi-Candidate LLM Training

ArXiv CS.AIby Rui Ai, Yu Pan, David Simchi-Levi, Chonghuan WangApril 1, 20261 min read0 views
Source Quiz

arXiv:2603.29871v1 Announce Type: new Abstract: In user-agent interaction scenarios such as recommendation, brainstorming, and code suggestion, Large Language Models (LLMs) often generate sets of candidate recommendations where the objective is to maximize the collective utility of the entire set rather than individual candidates independently. However, existing reinforcement learning post-training paradigms, such as Group Relative Policy Optimization (GRPO), typically assign the same set-level scalar reward to every candidate in the set. This leads to noisy training signals where poor candidates free-ride on the high reward produced by a single strong peer, resulting in suboptimal exploration. To address this, we propose Shapley-Enhanced GRPO (ShapE-GRPO). By leveraging the permutation-in

View PDF HTML (experimental)

Abstract:In user-agent interaction scenarios such as recommendation, brainstorming, and code suggestion, Large Language Models (LLMs) often generate sets of candidate recommendations where the objective is to maximize the collective utility of the entire set rather than individual candidates independently. However, existing reinforcement learning post-training paradigms, such as Group Relative Policy Optimization (GRPO), typically assign the same set-level scalar reward to every candidate in the set. This leads to noisy training signals where poor candidates free-ride on the high reward produced by a single strong peer, resulting in suboptimal exploration. To address this, we propose Shapley-Enhanced GRPO (ShapE-GRPO). By leveraging the permutation-invariant nature of set-level utility, we derive a Shapley-enhanced formulation from cooperative game theory to decompose set-level rewards into granular, candidate-specific signals. We show that our formulation preserves the fundamental axioms of the Shapley value while remaining computationally efficient with polynomial-time complexity. Empirically, ShapE-GRPO consistently outperforms standard GRPO across diverse datasets with accelerated convergence during training.

Subjects:

Artificial Intelligence (cs.AI)

Cite as: arXiv:2603.29871 [cs.AI]

(or arXiv:2603.29871v1 [cs.AI] for this version)

https://doi.org/10.48550/arXiv.2603.29871

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Rui Ai [view email] [v1] Tue, 31 Mar 2026 15:24:15 UTC (810 KB)

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by AI News Hub · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

modellanguage modeltraining

Knowledge Map

Knowledge Map
TopicsEntitiesSource
ShapE-GRPO:…modellanguage mo…trainingannouncepolicyagentArXiv CS.AI

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 143 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Models