GRPO

GRPO: Group Relative Policy Optimization

Beyond PPO Proximal Policy Optimization (PPO) has become the de facto algorithm for reinforcement learning from human feedback. Yet PPO has fundamental limitations when applied to language models: Absolute reward dependence: PPO optimizes absolute reward values, which are noisy and poorly calibrated KL divergence sensitivity: The KL penalty requires careful tuning to avoid collapse or divergence Sample inefficiency: Each prompt generates one response for learning Reward hacking: Models exploit reward model weaknesses Group Relative Policy Optimization (GRPO) addresses these issues through a simple insight: relative comparisons are more informative than absolute scores....

September 18, 2022 · 4 min · 674 words · Zach Kelling