GRPO: Group Relative Policy Optimization
Reinforcement learning from human feedback (RLHF) has become central to aligning language models with human preferences. But current methods like PPO are sample-inefficient and unstable. Today we introduce Group Relative Policy Optimization (GRPO), a new approach that addresses these limitations. The RLHF Challenge Standard RLHF follows three steps: Train a reward model on human preference data Use the reward model to provide training signal Optimize the policy with reinforcement learning (typically PPO) Step 3 is problematic....