GRPO: Group Relative Policy Optimization

Reinforcement learning from human feedback (RLHF) has become central to aligning language models with human preferences. But current methods like PPO are sample-inefficient and unstable. Today we introduce Group Relative Policy Optimization (GRPO), a new approach that addresses these limitations.

The RLHF Challenge

Standard RLHF follows three steps:

Train a reward model on human preference data
Use the reward model to provide training signal
Optimize the policy with reinforcement learning (typically PPO)

Step 3 is problematic. PPO requires careful hyperparameter tuning, extensive sampling, and still often produces unstable training dynamics.

Key Insight: Relative Comparisons

Humans naturally make relative judgments. “Response A is better than B” comes more easily than “Response A scores 7.3.” Yet reward models output absolute scores that discard this relational structure.

GRPO preserves relative comparisons throughout training.

The GRPO Algorithm

Instead of optimizing absolute rewards, GRPO optimizes within groups of sampled responses.

Sampling

For each prompt $x$, sample a group of $k$ responses:

$$G = {y_1, y_2, …, y_k} \sim \pi_\theta(y|x)$$

Ranking

Rank responses within the group using the reward model:

$$r_i = R(x, y_i)$$ $$\text{rank}(y_i) = |{j : r_j > r_i}| + 1$$

Relative Advantage

Compute advantages relative to the group:

$$A_i = \frac{r_i - \mu_G}{\sigma_G}$$

where $\mu_G$ and $\sigma_G$ are the group mean and standard deviation.

Policy Update

Update the policy to increase probability of high-ranked responses:

$$\mathcal{L}(\theta) = -\mathbb{E}{y \sim G}\left[\min\left(\frac{\pi\theta(y|x)}{\pi_{\text{old}}(y|x)} A, \text{clip}\left(\frac{\pi_\theta(y|x)}{\pi_{\text{old}}(y|x)}, 1-\epsilon, 1+\epsilon\right) A\right)\right]$$

The clipping stabilizes training, similar to PPO, but the relative advantages provide better signal.

Why GRPO Works

Robust to Reward Scale

Absolute reward values vary across prompts and reward model calibration. Normalization within groups removes this sensitivity.

Better Gradient Signal

Groups provide multiple comparison points per prompt. This reduces variance and improves sample efficiency.

Natural Curriculum

Hard prompts where all responses score poorly still provide useful gradients. The best-in-group response gets positive advantage even if absolute rewards are low.

Reduced Reward Hacking

Optimizing relative rankings is harder to game than optimizing absolute scores. The model must genuinely improve, not find reward model exploits.

Experimental Results

We compared GRPO to PPO on alignment benchmarks:

Method	Helpfulness	Harmlessness	Honesty	Samples Required
PPO	78.2	82.1	75.4	100K
GRPO	81.7	84.3	79.1	35K

GRPO achieves better alignment with 3x fewer samples.

Training dynamics also improve significantly:

Stability: GRPO loss curves show less variance
Convergence: Reaches final performance 2x faster
Robustness: Less sensitive to learning rate choice

Implementation Details

Key hyperparameters:

Group size ($k$): 8-16 works well
Clipping ($\epsilon$): 0.1-0.2
KL penalty: Lower than PPO (0.01 vs 0.1)

The larger group size compared to PPO’s typical 2-response setup is essential. More comparisons mean better gradient estimates.

Code Release

We’re releasing our GRPO implementation integrated with the Zen training framework:

from zen.alignment import GRPOTrainer

trainer = GRPOTrainer(
    model=model,
    reward_model=reward_model,
    group_size=12,
    clip_epsilon=0.15,
)

trainer.train(prompts)

Full documentation and examples at github.com/zoo-labs/zen-align.

What’s Next

GRPO opens several research directions:

Multi-objective GRPO: Separate groups for different alignment dimensions
Online preference learning: Update reward model during training
Constitutional GRPO: Use principles instead of learned rewards

Alignment is the central challenge of AI development. GRPO is one step toward making it more reliable.

Zach Kelling is a co-founder of Zoo Labs Foundation.

GRPO: Group Relative Policy Optimization

The RLHF Challenge#

Key Insight: Relative Comparisons#

The GRPO Algorithm#

Sampling#

Ranking#

Relative Advantage#

Policy Update#

Why GRPO Works#

Robust to Reward Scale#

Better Gradient Signal#

Natural Curriculum#

Reduced Reward Hacking#

Experimental Results#

Implementation Details#

Code Release#

What’s Next#