Drop-Upcycling and the Birth of Zen MoDE Architecture
DROP-UPCYCLING PAPER ZEN MODELS ZEN CODE Mixture of Experts (MoE) is the architecture that makes trillion-parameter models economically viable. By routing each token through a small subset of expert networks rather than the full parameter set, MoE achieves large-model quality at dense-model inference cost. The problem: training an MoE from scratch is expensive. You are paying for both the scale and the specialization overhead. Drop-Upcycling is a technique that converts a trained dense checkpoint into an MoE at roughly 1/4 the training cost of building the MoE from scratch....