Architecture

Drop-Upcycling and the Birth of Zen MoDE Architecture

DROP-UPCYCLING PAPER ZEN MODELS ZEN CODE Mixture of Experts (MoE) is the architecture that makes trillion-parameter models economically viable. By routing each token through a small subset of expert networks rather than the full parameter set, MoE achieves large-model quality at dense-model inference cost. The problem: training an MoE from scratch is expensive. You are paying for both the scale and the specialization overhead. Drop-Upcycling is a technique that converts a trained dense checkpoint into an MoE at roughly 1/4 the training cost of building the MoE from scratch....

February 28, 2026 · 7 min · 1443 words · Zen LM Team

Zen MoDE: Mixture of Distilled Experts

GITHUB HUGGING FACE All Zen models are built on Zen MoDE: Mixture of Distilled Experts. This post explains the architecture, why we chose it, and how distillation and expert routing interact to deliver frontier capability at practical inference cost. The Core Problem There is a fundamental tension in large model design: More parameters → better capability More parameters → higher inference cost Dense scaling laws are well established. Doubling parameters roughly halves perplexity (with sufficient data), but doubles inference FLOP....

January 16, 2026 · 4 min · 814 words · Zen LM Team