GRPO | Zen LM

Beyond PPO Proximal Policy Optimization (PPO) has become the de facto algorithm for reinforcement learning from human feedback. Yet PPO has fundamental limitations when applied to language models: Absolute reward dependence: PPO optimizes absolute reward values, which are noisy and poorly calibrated KL divergence sensitivity: The KL penalty requires careful tuning to avoid collapse or divergence Sample inefficiency: Each prompt generates one response for learning Reward hacking: Models exploit reward model weaknesses Group Relative Policy Optimization (GRPO) addresses these issues through a simple insight: relative comparisons are more informative than absolute scores....