Alignment

GRPO: Group Relative Policy Optimization

Reinforcement learning from human feedback (RLHF) has become central to aligning language models with human preferences. But current methods like PPO are sample-inefficient and unstable. Today we introduce Group Relative Policy Optimization (GRPO), a new approach that addresses these limitations. The RLHF Challenge Standard RLHF follows three steps: Train a reward model on human preference data Use the reward model to provide training signal Optimize the policy with reinforcement learning (typically PPO) Step 3 is problematic....

September 19, 2022 · 3 min · 522 words · Zach Kelling

GRPO: Group Relative Policy Optimization

Beyond PPO Proximal Policy Optimization (PPO) has become the de facto algorithm for reinforcement learning from human feedback. Yet PPO has fundamental limitations when applied to language models: Absolute reward dependence: PPO optimizes absolute reward values, which are noisy and poorly calibrated KL divergence sensitivity: The KL penalty requires careful tuning to avoid collapse or divergence Sample inefficiency: Each prompt generates one response for learning Reward hacking: Models exploit reward model weaknesses Group Relative Policy Optimization (GRPO) addresses these issues through a simple insight: relative comparisons are more informative than absolute scores....

September 18, 2022 · 4 min · 674 words · Zach Kelling