Models model language model training policy global alignment

Personalized Group Relative Policy Optimization for Heterogenous Preference Alignment

Apple Machine LearningApril 2, 20261 min read0 views

AuthorsJialu Wang, Heinrich Peters, Asad A. Butt, Navid Hashemi, Alireza Hashemi, Pouya M. Ghari, Joseph Hoover, James Rae, Morteza Dehghani

Despite their sophisticated general-purpose capabilities, Large Language Models (LLMs) often fail to align with diverse individual preferences because standard post-training methods, like Reinforcement Learning with Human Feedback (RLHF), optimize for a single, global objective. While Group Relative Policy Optimization (GRPO) is a widely adopted on-policy reinforcement learning framework, its group-based normalization implicitly assumes that all samples are exchangeable, inheriting this limitation in personalized settings. This assumption conflates distinct user reward distributions and systematically biases learning toward dominant preferences while suppressing minority signals. To address this, we introduce Personalized GRPO (P-GRPO), a novel alignment framework that decouples advantage estimation from immediate batch statistics. By normalizing advantages against preference-group-specific reward histories rather than the concurrent generation group, P-GRPO preserves the contrastive signal necessary for learning distinct preferences. We evaluate P-GRPO across diverse tasks and find that it consistently achieves faster convergence and higher rewards than standard GRPO, thereby enhancing its ability to recover and align with heterogeneous preference signals. Our results demonstrate that accounting for reward heterogeneity at the optimization level is essential for building models that faithfully align with diverse human preferences without sacrificing general capabilities.

Related readings and updates.

Reinforcement Learning from Human Feedback (RLHF) is an effective approach for aligning language models to human preferences. Central to RLHF is learning a reward function for scoring human preferences. Two main approaches for learning a reward model are 1) training an explicit reward model as in RLHF, and 2) using an implicit reward learned from preference data through methods such as Direct Preference Optimization (DPO). Prior work has shown…

This paper was accepted at the workshop at “Human-in-the-Loop Learning Workshop” at NeurIPS 2022.

Preference-based reinforcement learning (RL) algorithms help avoid the pitfalls of hand-crafted reward functions by distilling them from human preference feedback, but they remain impractical due to the burdensome number of labels required from the human, even for relatively simple tasks. In this work, we demonstrate that encoding environment…

Original source

Apple Machine Learning

https://machinelearning.apple.com/research/personalized-group

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

modellanguage modeltraining

Countries

South Korea steps up as global AI player with UN - Türkiye Today

South Korea steps up as global AI player with UN Türkiye Today

GNews AI Korea

1m18 days ago

ModelsFresh

Microsoft to invest $10 billion in Japan for AI infrastructure and training - The Daily Star

Microsoft to invest $10 billion in Japan for AI infrastructure and training The Daily Star

GNews AI Japan

1mabout 12 hours ago

Countries

Russo-Ukrainian war, day 1483: The drone war is going global as Ukraine's spring defense holds and its AI edge deepens - Euromaidan Press

Russo-Ukrainian war, day 1483: The drone war is going global as Ukraine's spring defense holds and its AI edge deepens Euromaidan Press

Google News - AI Ukraine

1m19 days ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 156 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Models

ModelsFresh

Microsoft to invest $10 billion in Japan for AI infrastructure and training - The Daily Star

Microsoft to invest $10 billion in Japan for AI infrastructure and training The Daily Star

GNews AI Japan

1mabout 12 hours ago

Models

Exclusive | Pentagon Used Anthropic’s Claude in Maduro Venezuela Raid - wsj.com

Exclusive | Pentagon Used Anthropic’s Claude in Maduro Venezuela Raid wsj.com

Google News - AI Venezuela

1mabout 2 months ago

ModelsLive

What is the effect on the Human mind from AI?

I am suggesting this topic because I know first hand that LLM-AI has changed how I process. My Story: At first I attempted to have GPT help write C code. Then the frustration between how it wrote code and how I learned to write code became an issue. I relented and decided to use what it gave me. I then allowed GPT to design the code and there I became lost and now fear dependency. In the end I see a need to go back to my own logic ,reasoning and design skills. So this is an issue that is advancing in the public realm and it has credibility. I thought to see what my HF peers think. -Ernst 1 post - 1 participant Read full topic

discuss.huggingface.co

1m41 minutes ago

Models

Exclusive | Meta Is Delaying the Rollout of Its Flagship AI Model - wsj.com

Exclusive | Meta Is Delaying the Rollout of Its Flagship AI Model wsj.com

GNews AI Llama

1m11 months ago