Beyond PPO

Proximal Policy Optimization (PPO) has become the de facto algorithm for reinforcement learning from human feedback. Yet PPO has fundamental limitations when applied to language models:

  1. Absolute reward dependence: PPO optimizes absolute reward values, which are noisy and poorly calibrated
  2. KL divergence sensitivity: The KL penalty requires careful tuning to avoid collapse or divergence
  3. Sample inefficiency: Each prompt generates one response for learning
  4. Reward hacking: Models exploit reward model weaknesses

Group Relative Policy Optimization (GRPO) addresses these issues through a simple insight: relative comparisons are more informative than absolute scores.

The GRPO Algorithm

Instead of scoring individual responses, GRPO generates a group of $K$ responses per prompt and learns from their relative rankings.

Response Generation

For each prompt $x$, sample $K$ responses from the current policy:

$$y_1, y_2, \ldots, y_K \sim \pi_\theta(\cdot | x)$$

Reward Computation

Score all responses with the reward model:

$$r_i = R(x, y_i) \quad \text{for } i = 1, \ldots, K$$

Advantage Estimation

Compute group-relative advantages:

$$A_i = \frac{r_i - \mu_r}{\sigma_r}$$

Where $\mu_r$ and $\sigma_r$ are the mean and standard deviation of rewards within the group.

Policy Update

Update the policy to increase probability of high-advantage responses:

$$\mathcal{L}{GRPO} = -\mathbb{E}{x, y \sim \pi_\theta}\left[\frac{\pi_\theta(y|x)}{\pi_{old}(y|x)} \cdot A(x, y) \cdot \mathbb{1}_{clip}\right]$$

Where $\mathbb{1}_{clip}$ applies PPO-style clipping to the importance ratio.

Why Group-Relative?

Noise Robustness

Reward models are noisy. A response scored 0.7 versus 0.6 may not be meaningfully better. But within a group of responses to the same prompt, relative ordering is more reliable:

MetricAbsolute ScoreRelative Rank
Inter-annotator agreement0.610.83
Test-retest reliability0.540.79
Reward model calibrationPoorN/A

Natural Normalization

Group-relative advantages automatically adapt to reward scale and prompt difficulty:

  • Easy prompts: All responses score high, advantages near zero
  • Hard prompts: Large variance, clear signal for improvement
  • Reward drift: Normalization handles changing baselines

Sample Efficiency

Generating $K$ responses per prompt and comparing them provides $\binom{K}{2}$ pairwise comparisons. For $K=8$, that’s 28 learning signals per prompt versus 1 for standard PPO.

Implementation Details

Group Size Selection

We find $K=8$ provides a good tradeoff:

KComputeSignal QualityBest Accuracy
22xLow71.2%
44xMedium74.8%
88xHigh77.3%
1616xMarginal gain77.9%

Temperature Schedule

Higher temperature during response generation increases group diversity:

def sample_group(prompt, policy, K=8):
    responses = []
    for i in range(K):
        temp = 0.7 + 0.3 * (i / K)  # 0.7 to 1.0
        response = policy.sample(prompt, temperature=temp)
        responses.append(response)
    return responses

KL Regularization

GRPO still benefits from KL regularization, but with reduced sensitivity:

$$\mathcal{L} = \mathcal{L}{GRPO} + \beta \cdot D{KL}(\pi_\theta || \pi_{ref})$$

We find $\beta = 0.01$ works across tasks, compared to PPO’s typical $\beta \in [0.001, 0.1]$ sensitivity.

Experimental Results

On Anthropic’s HH-RLHF benchmark:

MethodHelpfulnessHarmlessnessCompute
SFT3.2/53.8/51x
PPO3.9/54.1/510x
GRPO4.2/54.3/58x

GRPO achieves better alignment with less compute through efficient use of generated samples.

Reward Hacking Resistance

GRPO is naturally resistant to reward hacking because:

  1. Relative comparison: Hacked responses must beat other responses, not just achieve high absolute score
  2. Diverse sampling: Temperature variation produces varied response styles
  3. Group normalization: Exploits that boost all responses equally provide no gradient

We observe significantly less length gaming and repetition compared to PPO.

Code

Reference implementation:

def grpo_loss(policy, prompts, reward_model, K=8, clip_eps=0.2):
    losses = []
    for prompt in prompts:
        # Generate response group
        responses = sample_group(prompt, policy, K)

        # Compute rewards and advantages
        rewards = [reward_model(prompt, r) for r in responses]
        advantages = (rewards - np.mean(rewards)) / (np.std(rewards) + 1e-8)

        # Policy loss
        for response, advantage in zip(responses, advantages):
            ratio = policy.prob(response) / policy.prob_old(response)
            clipped = torch.clamp(ratio, 1-clip_eps, 1+clip_eps)
            loss = -torch.min(ratio * advantage, clipped * advantage)
            losses.append(loss)

    return torch.mean(torch.stack(losses))

Conclusion

GRPO offers a simple improvement to RLHF: generate multiple responses, compare them relatively, update toward the best. This approach is more robust, more sample-efficient, and more resistant to reward hacking than standard PPO.

The algorithm is simple enough to implement in an afternoon. The gains are substantial enough to matter.


Full details in “Group Relative Policy Optimization for Language Model Alignment” (2022). Code at github.com/zen-ai/grpo.