Reinforcement learning from human feedback (RLHF) has become central to aligning language models with human preferences. But current methods like PPO are sample-inefficient and unstable. Today we introduce Group Relative Policy Optimization (GRPO), a new approach that addresses these limitations.
The RLHF Challenge
Standard RLHF follows three steps:
- Train a reward model on human preference data
- Use the reward model to provide training signal
- Optimize the policy with reinforcement learning (typically PPO)
Step 3 is problematic. PPO requires careful hyperparameter tuning, extensive sampling, and still often produces unstable training dynamics.
Key Insight: Relative Comparisons
Humans naturally make relative judgments. “Response A is better than B” comes more easily than “Response A scores 7.3.” Yet reward models output absolute scores that discard this relational structure.
GRPO preserves relative comparisons throughout training.
The GRPO Algorithm
Instead of optimizing absolute rewards, GRPO optimizes within groups of sampled responses.
Sampling
For each prompt $x$, sample a group of $k$ responses:
$$G = {y_1, y_2, …, y_k} \sim \pi_\theta(y|x)$$
Ranking
Rank responses within the group using the reward model:
$$r_i = R(x, y_i)$$ $$\text{rank}(y_i) = |{j : r_j > r_i}| + 1$$
Relative Advantage
Compute advantages relative to the group:
$$A_i = \frac{r_i - \mu_G}{\sigma_G}$$
where $\mu_G$ and $\sigma_G$ are the group mean and standard deviation.
Policy Update
Update the policy to increase probability of high-ranked responses:
$$\mathcal{L}(\theta) = -\mathbb{E}{y \sim G}\left[\min\left(\frac{\pi\theta(y|x)}{\pi_{\text{old}}(y|x)} A, \text{clip}\left(\frac{\pi_\theta(y|x)}{\pi_{\text{old}}(y|x)}, 1-\epsilon, 1+\epsilon\right) A\right)\right]$$
The clipping stabilizes training, similar to PPO, but the relative advantages provide better signal.
Why GRPO Works
Robust to Reward Scale
Absolute reward values vary across prompts and reward model calibration. Normalization within groups removes this sensitivity.
Better Gradient Signal
Groups provide multiple comparison points per prompt. This reduces variance and improves sample efficiency.
Natural Curriculum
Hard prompts where all responses score poorly still provide useful gradients. The best-in-group response gets positive advantage even if absolute rewards are low.
Reduced Reward Hacking
Optimizing relative rankings is harder to game than optimizing absolute scores. The model must genuinely improve, not find reward model exploits.
Experimental Results
We compared GRPO to PPO on alignment benchmarks:
| Method | Helpfulness | Harmlessness | Honesty | Samples Required |
|---|---|---|---|---|
| PPO | 78.2 | 82.1 | 75.4 | 100K |
| GRPO | 81.7 | 84.3 | 79.1 | 35K |
GRPO achieves better alignment with 3x fewer samples.
Training dynamics also improve significantly:
- Stability: GRPO loss curves show less variance
- Convergence: Reaches final performance 2x faster
- Robustness: Less sensitive to learning rate choice
Implementation Details
Key hyperparameters:
- Group size ($k$): 8-16 works well
- Clipping ($\epsilon$): 0.1-0.2
- KL penalty: Lower than PPO (0.01 vs 0.1)
The larger group size compared to PPO’s typical 2-response setup is essential. More comparisons mean better gradient estimates.
Code Release
We’re releasing our GRPO implementation integrated with the Zen training framework:
from zen.alignment import GRPOTrainer
trainer = GRPOTrainer(
model=model,
reward_model=reward_model,
group_size=12,
clip_epsilon=0.15,
)
trainer.train(prompts)
Full documentation and examples at github.com/zoo-labs/zen-align.
What’s Next
GRPO opens several research directions:
- Multi-objective GRPO: Separate groups for different alignment dimensions
- Online preference learning: Update reward model during training
- Constitutional GRPO: Use principles instead of learned rewards
Alignment is the central challenge of AI development. GRPO is one step toward making it more reliable.
Zach Kelling is a co-founder of Zoo Labs Foundation.