Beyond PPO

Proximal Policy Optimization (PPO) has become the de facto algorithm for reinforcement learning from human feedback. Yet PPO has fundamental limitations when applied to language models:

Absolute reward dependence: PPO optimizes absolute reward values, which are noisy and poorly calibrated
KL divergence sensitivity: The KL penalty requires careful tuning to avoid collapse or divergence
Sample inefficiency: Each prompt generates one response for learning
Reward hacking: Models exploit reward model weaknesses

Group Relative Policy Optimization (GRPO) addresses these issues through a simple insight: relative comparisons are more informative than absolute scores.

The GRPO Algorithm

Instead of scoring individual responses, GRPO generates a group of $K$ responses per prompt and learns from their relative rankings.

Response Generation

For each prompt $x$, sample $K$ responses from the current policy:

$$y_1, y_2, \ldots, y_K \sim \pi_\theta(\cdot | x)$$

Reward Computation

Score all responses with the reward model:

$$r_i = R(x, y_i) \quad \text{for } i = 1, \ldots, K$$

Advantage Estimation

Compute group-relative advantages:

$$A_i = \frac{r_i - \mu_r}{\sigma_r}$$

Where $\mu_r$ and $\sigma_r$ are the mean and standard deviation of rewards within the group.

Policy Update

Update the policy to increase probability of high-advantage responses:

$$\mathcal{L}{GRPO} = -\mathbb{E}{x, y \sim \pi_\theta}\left[\frac{\pi_\theta(y|x)}{\pi_{old}(y|x)} \cdot A(x, y) \cdot \mathbb{1}_{clip}\right]$$

Where $\mathbb{1}_{clip}$ applies PPO-style clipping to the importance ratio.

Why Group-Relative?

Noise Robustness

Reward models are noisy. A response scored 0.7 versus 0.6 may not be meaningfully better. But within a group of responses to the same prompt, relative ordering is more reliable:

Metric	Absolute Score	Relative Rank
Inter-annotator agreement	0.61	0.83
Test-retest reliability	0.54	0.79
Reward model calibration	Poor	N/A

Natural Normalization

Group-relative advantages automatically adapt to reward scale and prompt difficulty:

Easy prompts: All responses score high, advantages near zero
Hard prompts: Large variance, clear signal for improvement
Reward drift: Normalization handles changing baselines

Sample Efficiency

Generating $K$ responses per prompt and comparing them provides $\binom{K}{2}$ pairwise comparisons. For $K=8$, that’s 28 learning signals per prompt versus 1 for standard PPO.

Implementation Details

Group Size Selection

We find $K=8$ provides a good tradeoff:

K	Compute	Signal Quality	Best Accuracy
2	2x	Low	71.2%
4	4x	Medium	74.8%
8	8x	High	77.3%
16	16x	Marginal gain	77.9%

Temperature Schedule

Higher temperature during response generation increases group diversity:

def sample_group(prompt, policy, K=8):
    responses = []
    for i in range(K):
        temp = 0.7 + 0.3 * (i / K)  # 0.7 to 1.0
        response = policy.sample(prompt, temperature=temp)
        responses.append(response)
    return responses

KL Regularization

GRPO still benefits from KL regularization, but with reduced sensitivity:

$$\mathcal{L} = \mathcal{L}{GRPO} + \beta \cdot D{KL}(\pi_\theta || \pi_{ref})$$

We find $\beta = 0.01$ works across tasks, compared to PPO’s typical $\beta \in [0.001, 0.1]$ sensitivity.

Experimental Results

On Anthropic’s HH-RLHF benchmark:

Method	Helpfulness	Harmlessness	Compute
SFT	3.2/5	3.8/5	1x
PPO	3.9/5	4.1/5	10x
GRPO	4.2/5	4.3/5	8x

GRPO achieves better alignment with less compute through efficient use of generated samples.

Reward Hacking Resistance

GRPO is naturally resistant to reward hacking because:

Relative comparison: Hacked responses must beat other responses, not just achieve high absolute score
Diverse sampling: Temperature variation produces varied response styles
Group normalization: Exploits that boost all responses equally provide no gradient

We observe significantly less length gaming and repetition compared to PPO.

Code

Reference implementation:

def grpo_loss(policy, prompts, reward_model, K=8, clip_eps=0.2):
    losses = []
    for prompt in prompts:
        # Generate response group
        responses = sample_group(prompt, policy, K)

        # Compute rewards and advantages
        rewards = [reward_model(prompt, r) for r in responses]
        advantages = (rewards - np.mean(rewards)) / (np.std(rewards) + 1e-8)

        # Policy loss
        for response, advantage in zip(responses, advantages):
            ratio = policy.prob(response) / policy.prob_old(response)
            clipped = torch.clamp(ratio, 1-clip_eps, 1+clip_eps)
            loss = -torch.min(ratio * advantage, clipped * advantage)
            losses.append(loss)

    return torch.mean(torch.stack(losses))

Conclusion

GRPO offers a simple improvement to RLHF: generate multiple responses, compare them relatively, update toward the best. This approach is more robust, more sample-efficient, and more resistant to reward hacking than standard PPO.

The algorithm is simple enough to implement in an afternoon. The gains are substantial enough to matter.

Full details in “Group Relative Policy Optimization for Language Model Alignment” (2022). Code at github.com/zen-ai/grpo.

GRPO: Group Relative Policy Optimization

Beyond PPO#

The GRPO Algorithm#

Response Generation#

Reward Computation#

Advantage Estimation#

Policy Update#

Why Group-Relative?#

Noise Robustness#

Natural Normalization#

Sample Efficiency#

Implementation Details#

Group Size Selection#

Temperature Schedule#

KL Regularization#

Experimental Results#

Reward Hacking Resistance#

Code#

Conclusion#