Research

GRPO: Group Relative Policy Optimization

Beyond PPO Proximal Policy Optimization (PPO) has become the de facto algorithm for reinforcement learning from human feedback. Yet PPO has fundamental limitations when applied to language models: Absolute reward dependence: PPO optimizes absolute reward values, which are noisy and poorly calibrated KL divergence sensitivity: The KL penalty requires careful tuning to avoid collapse or divergence Sample inefficiency: Each prompt generates one response for learning Reward hacking: Models exploit reward model weaknesses Group Relative Policy Optimization (GRPO) addresses these issues through a simple insight: relative comparisons are more informative than absolute scores....

Federated Learning Without Compromise

The Privacy-Utility Tradeoff Federated learning promises to train models on distributed data without centralizing sensitive information. In practice, existing approaches force uncomfortable tradeoffs: Differential privacy adds noise that degrades model quality Secure aggregation increases communication costs Data heterogeneity causes convergence problems Byzantine participants can poison the model We present techniques that mitigate these tradeoffs. Our Approach Adaptive Clipping Standard gradient clipping uses a fixed threshold $C$: $$g_i^{clipped} = g_i \cdot \min\left(1, \frac{C}{|g_i|}\right)$$...

Federated Learning for Open AI

Training large language models requires vast amounts of data. That data often contains sensitive information. Federated learning offers a path to train on distributed, private data without centralizing it. The Centralization Problem Traditional ML training follows a simple pattern: collect data, aggregate it centrally, train models. This creates problems: Privacy risk: Sensitive data leaves user control Legal barriers: Regulations prevent data movement across jurisdictions Trust requirements: Data holders must trust the training party Single points of failure: Central aggregation creates vulnerabilities Federated Learning Basics Federated learning inverts the pattern....

Experience Ledgers: Persistent Memory for AI Agents

AI agents today suffer from amnesia. Each conversation starts fresh. Each session forgets the last. This isn’t just an inconvenience; it’s a fundamental limitation on what agents can become. Today we introduce experience ledgers, a framework for persistent, verifiable agent memory. The Memory Problem Current language models operate in bounded context windows. Information from past interactions must be explicitly retrieved or summarized. This creates several challenges: Context limits: Models can only attend to finite token sequences Retrieval failures: Important context gets lost or incorrectly recalled No learning: Agents don’t improve from experience within deployment Trust gap: Users can’t verify what the agent “remembers” Experience Ledgers An experience ledger is an append-only log of agent experiences with cryptographic attestation....

Training LLMs on Collective Intelligence

Language models are trained on text. That text represents the accumulated knowledge, reasoning, and creativity of countless individuals. Yet the curation process that selects training data receives surprisingly little attention. The Data Problem Most large language models are trained on web scrapes filtered by simple heuristics. This approach has several issues: Quality variance: Web content ranges from expert research to spam Hidden biases: Filtering decisions embed value judgments Provenance opacity: It’s unclear what’s included or excluded Legal ambiguity: Copyright and consent questions remain unresolved Our Approach: Transparent Curation At Zen, we’re taking a different path....