Models model language model training announce policy agent

ShapE-GRPO: Shapley-Enhanced Reward Allocation for Multi-Candidate LLM Training

ArXiv CS.AIby Rui Ai, Yu Pan, David Simchi-Levi, Chonghuan WangApril 1, 20261 min read0 views

arXiv:2603.29871v1 Announce Type: new Abstract: In user-agent interaction scenarios such as recommendation, brainstorming, and code suggestion, Large Language Models (LLMs) often generate sets of candidate recommendations where the objective is to maximize the collective utility of the entire set rather than individual candidates independently. However, existing reinforcement learning post-training paradigms, such as Group Relative Policy Optimization (GRPO), typically assign the same set-level scalar reward to every candidate in the set. This leads to noisy training signals where poor candidates free-ride on the high reward produced by a single strong peer, resulting in suboptimal exploration. To address this, we propose Shapley-Enhanced GRPO (ShapE-GRPO). By leveraging the permutation-in

View PDF HTML (experimental)

Abstract:In user-agent interaction scenarios such as recommendation, brainstorming, and code suggestion, Large Language Models (LLMs) often generate sets of candidate recommendations where the objective is to maximize the collective utility of the entire set rather than individual candidates independently. However, existing reinforcement learning post-training paradigms, such as Group Relative Policy Optimization (GRPO), typically assign the same set-level scalar reward to every candidate in the set. This leads to noisy training signals where poor candidates free-ride on the high reward produced by a single strong peer, resulting in suboptimal exploration. To address this, we propose Shapley-Enhanced GRPO (ShapE-GRPO). By leveraging the permutation-invariant nature of set-level utility, we derive a Shapley-enhanced formulation from cooperative game theory to decompose set-level rewards into granular, candidate-specific signals. We show that our formulation preserves the fundamental axioms of the Shapley value while remaining computationally efficient with polynomial-time complexity. Empirically, ShapE-GRPO consistently outperforms standard GRPO across diverse datasets with accelerated convergence during training.

Subjects:

Artificial Intelligence (cs.AI)

Cite as: arXiv:2603.29871 [cs.AI]

(or arXiv:2603.29871v1 [cs.AI] for this version)

https://doi.org/10.48550/arXiv.2603.29871

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Rui Ai [view email] [v1] Tue, 31 Mar 2026 15:24:15 UTC (810 KB)

Original source

ArXiv CS.AI

https://arxiv.org/abs/2603.29871

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

modellanguage modeltraining

Laws & RegulationLive

Compliance Policies: AI Policy & Upcoming Incident Response Plan Deadline - natlawreview.com

<a href="https://news.google.com/rss/articles/CBMipgFBVV95cUxNdmNLZXpNUGhkRjNLdEZ3clJOb3Y0cV9VUDh2bG1haVhxZFlYVmJUdkRZX3hyV1o2YWdISThLMGFHM2NBMkRzdnFVdWo0ZTVBQm9QTENkcktIdVRQcldLejNhR282aFBBa09fRU5TU0JvbzBRZU9vdHRyc0dBcjhnUmxhVEc3ZG9uUWRCaDNjMEI3b3FHRmtrZS1TUFJYTjZPdDZUeDJ30gGrAUFVX3lxTE82U3lqeTdZeXVkcmljMkt5Wmw3YXg0TXk1bFdSWWdxNS1OYjFNVFR5NThNQWhwajJKRkVjTzNaV1hqRTliZ0RMbzFrdFhJWFZKYWxFS3VVRDVxbmc5Rm9XOVhpdnE5Qm1UWUlZTXpYRURZS3pDeEpoZUFvWU1DZ0tnc1dTb1czVEZmS2NGb0QwVXdTc1Q1bE9EcVltZktRampTUTlhaWdyM1IzQQ?oc=5" target="_blank">Compliance Policies: AI Policy & Upcoming Incident Response Plan Deadline</a> natlawreview.com

Google News: AI

1m27 minutes ago

ModelsFresh

Anthropic Races to Contain Leak of Code Behind Claude AI Agent - WSJ

<a href="https://news.google.com/rss/articles/CBMipgNBVV95cUxQYW9PZWRWT2dwQ2xyb1R1U2dxdVNwa1ZZRjRtZHZRS1hnNXRpZTZjdzMxS0tiWV9ncFVqQ0lRcUVELU1TQXgwZ0t0S0dqb254MENscUo0cURVT1lDZWJrenJGRVZiX01kUFVrSkFFT25abi1QdTRQX0RXNDc5LUFBRk5ON0xOdHIybFh5UURScXh0OWQ4ejlfUFNTZlRuMDYtTFhReG5tdHo5cTZSVEtCUHJCd2d5Ti1NZWNHZlJJTjdUa0thbTVGWlFvUmwwZ2ZlV0llajlLaE9NQXRtaUdiWTVKN1YzeWJFT2lJb0paU1dkUldnQnBocE51UnlMdXNfM1FfcUpjYUdpZEl3MUNXRnRYdzFGQks2RGFVQ1JucVVSR19Nc0JhVGR4cWRDazdVQTZlM0ZjWEZaZFF1YjdkUFNlYm5sVmhqZVUtRGJBME54eTRUX2ZMQ1d4cHYzaVgyZHdkMW9JMzJydGRRdC1nTEFVdGk3N0JFV19GT3ZKdC0xNzdoR01fRS1VWDNRSlZhWl9ReU96alZfVG00WldpN1l4ZERiUQ?oc=5" target="_blank">Anthropic Races to Contain Leak of Code Behind Claude AI Agent</a> WSJ

Google News: Claude

1mabout 3 hours ago

Market NewsLive

Valuations are 'Punchy': Salesforce's Drews

Paul Drews, managing partner at Salesforce Ventures and Manthan Shah, principal and head of US investments at WestBridge Capital join Dani Burger on "Bloomberg Deals." OpenAI announced this week its largest-ever fundraising, drawing $122 billion in backing from tech giants, venture capital funds and retail investors alike. Now the valuation for OpenAI sits at $852 billion while Anthropic’s valuation is at $380 billion. (Source: Bloomberg)

Bloomberg Technology

1m17 minutes ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 143 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Models

ModelsLive

Early AI Use Risks Children’s Development, Safety: UN - Mexico Business News

<a href="https://news.google.com/rss/articles/CBMinAFBVV95cUxOWXU2VllmcjhhQ0FlRmJnLXFmRGpjSXR4OUtSMlVNV0NCNFdXSHB6UFExdmhUc21TZ1lPUkpxZjVWX0VaZ3BsQmpPSkQxSUJxLTlmS2hYZjMtZjdVSVJadS1wekNzZExuMzlnVmVCbXpORzRudjNUTmhHRlAtZWZHZ2dTQWVTemYydGZVRnV5V2tUSjg3Xzlhc01LTjE?oc=5" target="_blank">Early AI Use Risks Children’s Development, Safety: UN</a> Mexico Business News

Google News: AI Safety

1m20 minutes ago

ModelsFresh

Anthropic Issues Copyright Takedowns to Scrub Claude Code Leak - PCMag

<a href="https://news.google.com/rss/articles/CBMimgFBVV95cUxONHFWbmMwRS1oazRWNE9qdU81YTBJZXNEMHBZSjU3MU9WaEQ4VWVEbGlSaDlDWVlQcnk1VjQydG9oRDZ4eEtkYnhFajc3Tlk1UVJpTTZwTmRDbkdYYTQwckNoMDhodVdaSjg4XzNRR0tPNVV3OHJTQllYR0NOSW9zdWlhQVVPdU5SZFpkWXNSWW9LNWZtRjZ2YW1R?oc=5" target="_blank">Anthropic Issues Copyright Takedowns to Scrub Claude Code Leak</a> PCMag

Google News: Claude

1mabout 2 hours ago

ModelsLive

ChatGPT Flops for Psychotic Prompts; ADHD Drug and Psychosis; Metabolic Psychiatry - MedPage Today

<a href="https://news.google.com/rss/articles/CBMickFVX3lxTE1rY1NQX1k2cEs5ZUQxWFF4eV8wNFNrUFZiNWlhX0NadzZya2hha29IUmR5T0g1MDk2emFRVGZ6OEhuQlNYVTRoU0FZeDB1NlVJR2RpdjlPY0dHV2xqQ0NReEV3WHJuZGxMYzcyRWppazR1QQ?oc=5" target="_blank">ChatGPT Flops for Psychotic Prompts; ADHD Drug and Psychosis; Metabolic Psychiatry</a> MedPage Today

Google News: ChatGPT

1mabout 2 hours ago

ModelsLive

Claude Code Leak Creates New Problems for Anthropic - The National CIO Review

<a href="https://news.google.com/rss/articles/CBMisgFBVV95cUxPZlU5alc3T050VkNsbDRNVmRGTWNUYUNQM2NuaEFKVGxTS0F0MHNIUFNVMXNWN2podmxSSUxmQXNFUG5LcjFRVm9OekdJMTc2WDhoNXdyRkNWTGJvZURjc2tZWC03a0diZU5zN0hEU1A3LXVvUlppSG9EQU9kcjkxanZWdE95Z3oxZklzVjVPYV9QV0ZsMVlQYm9jUjVGRlRwV0plaWEtV193dFFqWE92UFFB?oc=5" target="_blank">Claude Code Leak Creates New Problems for Anthropic</a> The National CIO Review

Google News: Claude

1mabout 1 hour ago