ShapE-GRPO: Shapley-Enhanced Reward Allocation for Multi-Candidate LLM Training
arXiv:2603.29871v1 Announce Type: new Abstract: In user-agent interaction scenarios such as recommendation, brainstorming, and code suggestion, Large Language Models (LLMs) often generate sets of candidate recommendations where the objective is to maximize the collective utility of the entire set rather than individual candidates independently. However, existing reinforcement learning post-training paradigms, such as Group Relative Policy Optimization (GRPO), typically assign the same set-level scalar reward to every candidate in the set. This leads to noisy training signals where poor candidates free-ride on the high reward produced by a single strong peer, resulting in suboptimal exploration. To address this, we propose Shapley-Enhanced GRPO (ShapE-GRPO). By leveraging the permutation-in
View PDF HTML (experimental)
Abstract:In user-agent interaction scenarios such as recommendation, brainstorming, and code suggestion, Large Language Models (LLMs) often generate sets of candidate recommendations where the objective is to maximize the collective utility of the entire set rather than individual candidates independently. However, existing reinforcement learning post-training paradigms, such as Group Relative Policy Optimization (GRPO), typically assign the same set-level scalar reward to every candidate in the set. This leads to noisy training signals where poor candidates free-ride on the high reward produced by a single strong peer, resulting in suboptimal exploration. To address this, we propose Shapley-Enhanced GRPO (ShapE-GRPO). By leveraging the permutation-invariant nature of set-level utility, we derive a Shapley-enhanced formulation from cooperative game theory to decompose set-level rewards into granular, candidate-specific signals. We show that our formulation preserves the fundamental axioms of the Shapley value while remaining computationally efficient with polynomial-time complexity. Empirically, ShapE-GRPO consistently outperforms standard GRPO across diverse datasets with accelerated convergence during training.
Subjects:
Artificial Intelligence (cs.AI)
Cite as: arXiv:2603.29871 [cs.AI]
(or arXiv:2603.29871v1 [cs.AI] for this version)
https://doi.org/10.48550/arXiv.2603.29871
arXiv-issued DOI via DataCite (pending registration)
Submission history
From: Rui Ai [view email] [v1] Tue, 31 Mar 2026 15:24:15 UTC (810 KB)
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
modellanguage modeltrainingCompliance Policies: AI Policy & Upcoming Incident Response Plan Deadline - natlawreview.com
<a href="https://news.google.com/rss/articles/CBMipgFBVV95cUxNdmNLZXpNUGhkRjNLdEZ3clJOb3Y0cV9VUDh2bG1haVhxZFlYVmJUdkRZX3hyV1o2YWdISThLMGFHM2NBMkRzdnFVdWo0ZTVBQm9QTENkcktIdVRQcldLejNhR282aFBBa09fRU5TU0JvbzBRZU9vdHRyc0dBcjhnUmxhVEc3ZG9uUWRCaDNjMEI3b3FHRmtrZS1TUFJYTjZPdDZUeDJ30gGrAUFVX3lxTE82U3lqeTdZeXVkcmljMkt5Wmw3YXg0TXk1bFdSWWdxNS1OYjFNVFR5NThNQWhwajJKRkVjTzNaV1hqRTliZ0RMbzFrdFhJWFZKYWxFS3VVRDVxbmc5Rm9XOVhpdnE5Qm1UWUlZTXpYRURZS3pDeEpoZUFvWU1DZ0tnc1dTb1czVEZmS2NGb0QwVXdTc1Q1bE9EcVltZktRampTUTlhaWdyM1IzQQ?oc=5" target="_blank">Compliance Policies: AI Policy & Upcoming Incident Response Plan Deadline</a> <font color="#6f6f6f">natlawreview.com</font>
Anthropic Races to Contain Leak of Code Behind Claude AI Agent - WSJ
<a href="https://news.google.com/rss/articles/CBMipgNBVV95cUxQYW9PZWRWT2dwQ2xyb1R1U2dxdVNwa1ZZRjRtZHZRS1hnNXRpZTZjdzMxS0tiWV9ncFVqQ0lRcUVELU1TQXgwZ0t0S0dqb254MENscUo0cURVT1lDZWJrenJGRVZiX01kUFVrSkFFT25abi1QdTRQX0RXNDc5LUFBRk5ON0xOdHIybFh5UURScXh0OWQ4ejlfUFNTZlRuMDYtTFhReG5tdHo5cTZSVEtCUHJCd2d5Ti1NZWNHZlJJTjdUa0thbTVGWlFvUmwwZ2ZlV0llajlLaE9NQXRtaUdiWTVKN1YzeWJFT2lJb0paU1dkUldnQnBocE51UnlMdXNfM1FfcUpjYUdpZEl3MUNXRnRYdzFGQks2RGFVQ1JucVVSR19Nc0JhVGR4cWRDazdVQTZlM0ZjWEZaZFF1YjdkUFNlYm5sVmhqZVUtRGJBME54eTRUX2ZMQ1d4cHYzaVgyZHdkMW9JMzJydGRRdC1nTEFVdGk3N0JFV19GT3ZKdC0xNzdoR01fRS1VWDNRSlZhWl9ReU96alZfVG00WldpN1l4ZERiUQ?oc=5" target="_blank">Anthropic Races to Contain Leak of Code Behind Claude AI Agent</a> <font color="#6f6f6f">WSJ</font>

Valuations are 'Punchy': Salesforce's Drews
Paul Drews, managing partner at Salesforce Ventures and Manthan Shah, principal and head of US investments at WestBridge Capital join Dani Burger on "Bloomberg Deals." OpenAI announced this week its largest-ever fundraising, drawing $122 billion in backing from tech giants, venture capital funds and retail investors alike. Now the valuation for OpenAI sits at $852 billion while Anthropic’s valuation is at $380 billion. (Source: Bloomberg)
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Models
Early AI Use Risks Children’s Development, Safety: UN - Mexico Business News
<a href="https://news.google.com/rss/articles/CBMinAFBVV95cUxOWXU2VllmcjhhQ0FlRmJnLXFmRGpjSXR4OUtSMlVNV0NCNFdXSHB6UFExdmhUc21TZ1lPUkpxZjVWX0VaZ3BsQmpPSkQxSUJxLTlmS2hYZjMtZjdVSVJadS1wekNzZExuMzlnVmVCbXpORzRudjNUTmhHRlAtZWZHZ2dTQWVTemYydGZVRnV5V2tUSjg3Xzlhc01LTjE?oc=5" target="_blank">Early AI Use Risks Children’s Development, Safety: UN</a> <font color="#6f6f6f">Mexico Business News</font>
Anthropic Issues Copyright Takedowns to Scrub Claude Code Leak - PCMag
<a href="https://news.google.com/rss/articles/CBMimgFBVV95cUxONHFWbmMwRS1oazRWNE9qdU81YTBJZXNEMHBZSjU3MU9WaEQ4VWVEbGlSaDlDWVlQcnk1VjQydG9oRDZ4eEtkYnhFajc3Tlk1UVJpTTZwTmRDbkdYYTQwckNoMDhodVdaSjg4XzNRR0tPNVV3OHJTQllYR0NOSW9zdWlhQVVPdU5SZFpkWXNSWW9LNWZtRjZ2YW1R?oc=5" target="_blank">Anthropic Issues Copyright Takedowns to Scrub Claude Code Leak</a> <font color="#6f6f6f">PCMag</font>
ChatGPT Flops for Psychotic Prompts; ADHD Drug and Psychosis; Metabolic Psychiatry - MedPage Today
<a href="https://news.google.com/rss/articles/CBMickFVX3lxTE1rY1NQX1k2cEs5ZUQxWFF4eV8wNFNrUFZiNWlhX0NadzZya2hha29IUmR5T0g1MDk2emFRVGZ6OEhuQlNYVTRoU0FZeDB1NlVJR2RpdjlPY0dHV2xqQ0NReEV3WHJuZGxMYzcyRWppazR1QQ?oc=5" target="_blank">ChatGPT Flops for Psychotic Prompts; ADHD Drug and Psychosis; Metabolic Psychiatry</a> <font color="#6f6f6f">MedPage Today</font>
Claude Code Leak Creates New Problems for Anthropic - The National CIO Review
<a href="https://news.google.com/rss/articles/CBMisgFBVV95cUxPZlU5alc3T050VkNsbDRNVmRGTWNUYUNQM2NuaEFKVGxTS0F0MHNIUFNVMXNWN2podmxSSUxmQXNFUG5LcjFRVm9OekdJMTc2WDhoNXdyRkNWTGJvZURjc2tZWC03a0diZU5zN0hEU1A3LXVvUlppSG9EQU9kcjkxanZWdE95Z3oxZklzVjVPYV9QV0ZsMVlQYm9jUjVGRlRwV0plaWEtV193dFFqWE92UFFB?oc=5" target="_blank">Claude Code Leak Creates New Problems for Anthropic</a> <font color="#6f6f6f">The National CIO Review</font>

Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!