Reinforcement learning from human feedback (RLHF) has become central to aligning language models with human preferences. But current methods like PPO are sample-inefficient and unstable. Today we introduce Group Relative Policy Optimization (GRPO), a new approach that addresses these limitations.

The RLHF Challenge

Standard RLHF follows three steps:

  1. Train a reward model on human preference data
  2. Use the reward model to provide training signal
  3. Optimize the policy with reinforcement learning (typically PPO)

Step 3 is problematic. PPO requires careful hyperparameter tuning, extensive sampling, and still often produces unstable training dynamics.

Key Insight: Relative Comparisons

Humans naturally make relative judgments. “Response A is better than B” comes more easily than “Response A scores 7.3.” Yet reward models output absolute scores that discard this relational structure.

GRPO preserves relative comparisons throughout training.

The GRPO Algorithm

Instead of optimizing absolute rewards, GRPO optimizes within groups of sampled responses.

Sampling

For each prompt $x$, sample a group of $k$ responses:

$$G = {y_1, y_2, …, y_k} \sim \pi_\theta(y|x)$$

Ranking

Rank responses within the group using the reward model:

$$r_i = R(x, y_i)$$ $$\text{rank}(y_i) = |{j : r_j > r_i}| + 1$$

Relative Advantage

Compute advantages relative to the group:

$$A_i = \frac{r_i - \mu_G}{\sigma_G}$$

where $\mu_G$ and $\sigma_G$ are the group mean and standard deviation.

Policy Update

Update the policy to increase probability of high-ranked responses:

$$\mathcal{L}(\theta) = -\mathbb{E}{y \sim G}\left[\min\left(\frac{\pi\theta(y|x)}{\pi_{\text{old}}(y|x)} A, \text{clip}\left(\frac{\pi_\theta(y|x)}{\pi_{\text{old}}(y|x)}, 1-\epsilon, 1+\epsilon\right) A\right)\right]$$

The clipping stabilizes training, similar to PPO, but the relative advantages provide better signal.

Why GRPO Works

Robust to Reward Scale

Absolute reward values vary across prompts and reward model calibration. Normalization within groups removes this sensitivity.

Better Gradient Signal

Groups provide multiple comparison points per prompt. This reduces variance and improves sample efficiency.

Natural Curriculum

Hard prompts where all responses score poorly still provide useful gradients. The best-in-group response gets positive advantage even if absolute rewards are low.

Reduced Reward Hacking

Optimizing relative rankings is harder to game than optimizing absolute scores. The model must genuinely improve, not find reward model exploits.

Experimental Results

We compared GRPO to PPO on alignment benchmarks:

MethodHelpfulnessHarmlessnessHonestySamples Required
PPO78.282.175.4100K
GRPO81.784.379.135K

GRPO achieves better alignment with 3x fewer samples.

Training dynamics also improve significantly:

  • Stability: GRPO loss curves show less variance
  • Convergence: Reaches final performance 2x faster
  • Robustness: Less sensitive to learning rate choice

Implementation Details

Key hyperparameters:

  • Group size ($k$): 8-16 works well
  • Clipping ($\epsilon$): 0.1-0.2
  • KL penalty: Lower than PPO (0.01 vs 0.1)

The larger group size compared to PPO’s typical 2-response setup is essential. More comparisons mean better gradient estimates.

Code Release

We’re releasing our GRPO implementation integrated with the Zen training framework:

from zen.alignment import GRPOTrainer

trainer = GRPOTrainer(
    model=model,
    reward_model=reward_model,
    group_size=12,
    clip_epsilon=0.15,
)

trainer.train(prompts)

Full documentation and examples at github.com/zoo-labs/zen-align.

What’s Next

GRPO opens several research directions:

  • Multi-objective GRPO: Separate groups for different alignment dimensions
  • Online preference learning: Update reward model during training
  • Constitutional GRPO: Use principles instead of learned rewards

Alignment is the central challenge of AI development. GRPO is one step toward making it more reliable.


Zach Kelling is a co-founder of Zoo Labs Foundation.