BITDELTA PAPER MONOSOUP PAPER K-MERGE PAPER ZEN MODELS
The Zen model family has a deployment problem that is not immediately obvious from the outside. We publish 14+ distinct model variants — from zen-nano at 0.6B parameters to zen4-ultra at 1.04T. Each variant carries fine-tuned behavioral characteristics: different personas, different task specializations, different safety postures. In a naive serving architecture, each variant is a separate set of weights. Loading all of them onto a GPU cluster is economically impossible.
BitDelta is how we solve this. It compresses the behavioral delta between a base model and a fine-tuned variant down to 1-bit precision, reducing the per-variant memory cost by 16-32x while retaining 99.3% of full-precision behavioral accuracy.
The Multi-Variant Deployment Problem
Consider the economics concretely. A single zen4-ultra shard (1.04T parameters, bfloat16) requires roughly 2TB of GPU memory. Even a single full-precision variant of zen-max (72B) requires ~144GB. With 14 variants across our model catalog:
| Tier | Parameters | Full Precision (BF16) | Variants | Total |
|---|---|---|---|---|
| nano | 0.6B | 1.2 GB | 4 | 4.8 GB |
| eco / coder-4b | 4B | 8 GB | 3 | 24 GB |
| zen4-max | 30B | 60 GB | 3 | 180 GB |
| zen-max | 72B | 144 GB | 2 | 288 GB |
| zen4-ultra | 1.04T | ~2 TB | 1 | ~2 TB |
Keeping all of these “hot” simultaneously is not feasible. Cold-loading from object storage introduces latency spikes that make the service unusable. We need a different architecture.
The key observation: most variants share an identical base model. The behavioral differences — the fine-tuned identity, the task specialization, the adjusted refusal boundaries — live in the delta between fine-tuned weights and base weights. If we can compress that delta aggressively, we can keep only the base model fully loaded and reconstruct any variant on the fly.
BitDelta Theory
Paper: arXiv:2402.10193
BitDelta decomposes a fine-tuned weight matrix as:
W_ft = W_base + Δ
and approximates the delta with 1-bit quantization:
Δ ≈ α · sign(Δ)
where the scale factor α is the mean absolute value of the delta entries:
α = (1/n) Σ |Δ_ij|
This is a single scalar per weight matrix. The sign matrix is 1-bit per element. Total storage for the delta: n bits + 1 float32. For a 4096×4096 weight matrix, that is 16MB → 2MB. For the full zen-max 72B delta, the storage requirement drops from ~144GB to ~9GB.
Why does 1-bit sign quantization work? The delta values in fine-tuned LLMs follow a near-Laplace distribution centered at zero. The signs carry the directional information; the scale α captures the magnitude. The residual error:
ε = Δ - α · sign(Δ)
has bounded expected squared norm:
E[||ε||²] ≤ (1 - 2/π) · ||Δ||² ≈ 0.36 · ||Δ||²
In practice (and this is the empirical surprise), the effective error on model outputs is far smaller than this bound suggests, because the residuals are uncorrelated with the task-relevant signal directions. The model’s behavioral accuracy degrades gracefully rather than catastrophically.
Implementation: Fused CUDA Kernel
The critical implementation detail is efficiency. Reconstructing W_ft = W_base + α · sign(Δ) at inference time must not add meaningful latency. Our CUDA kernel fuses three operations:
- Load sign bits from compressed storage (1-bit tensor, integer packing)
- Unpack and scale:
delta_row = alpha * sign_bits.float() * 2 - 1 - Add to base weight tile in shared memory before GEMM
The result: delta reconstruction adds less than 1ms overhead per forward pass on an A100. In practice the overhead is dominated by memory bandwidth to load the sign bits, which at 1/16th the size of the base weight tensor is negligible.
import torch
def compress_delta(W_ft: torch.Tensor, W_base: torch.Tensor) -> tuple[torch.Tensor, torch.Tensor]:
"""Compress fine-tuned weight delta to 1-bit + scale."""
delta = W_ft - W_base
alpha = delta.abs().mean()
sign_bits = (delta > 0).to(torch.uint8) # 1 = positive, 0 = negative
return sign_bits, alpha
def reconstruct_weight(W_base: torch.Tensor, sign_bits: torch.Tensor, alpha: torch.Tensor) -> torch.Tensor:
"""Reconstruct fine-tuned weight from base + compressed delta."""
signs = sign_bits.float() * 2 - 1 # map {0,1} → {-1,+1}
return W_base + alpha * signs
def memory_savings(d_out: int, d_in: int) -> dict:
"""Compare memory usage: full delta vs BitDelta."""
full_bytes = d_out * d_in * 2 # bfloat16
bitdelta_bytes = d_out * d_in // 8 + 4 # 1-bit + float32 scale
return {
'full_delta_mb': full_bytes / 1e6,
'bitdelta_mb': bitdelta_bytes / 1e6,
'compression_ratio': full_bytes / bitdelta_bytes,
}
Quality Results
We evaluated BitDelta across five Zen variants against their full-precision counterparts:
| Model | Task | Full Precision | BitDelta | Retention |
|---|---|---|---|---|
| zen-nano | MMLU | 61.3 | 60.8 | 99.2% |
| zen4-max | HumanEval | 74.1 | 73.5 | 99.1% |
| zen4-pro | GSM8K | 88.4 | 87.9 | 99.4% |
| zen-max | GPQA | 71.2 | 70.6 | 99.2% |
| zen4-ultra | AIME 2024 | 94.7 | 93.6 | 98.8% |
Average behavioral retention: 99.3%. The 0.7% average degradation is below the noise floor of our human preference evaluations — users cannot reliably distinguish BitDelta variants from full-precision variants in blind A/B tests.
MonoSoup: SVD Fallback for Weak Checkpoints
Paper: arXiv:2602.09689
BitDelta works well when the delta is well-behaved (small, distributed, near-Laplace). Some fine-tuned checkpoints — particularly those from aggressive few-shot fine-tuning or noisy datasets — produce deltas that are large and spiky. In these cases, 1-bit quantization introduces perceptible degradation.
MonoSoup provides a complementary approach: instead of compressing the delta, decompose the full fine-tuned weight via SVD and keep only the top-k singular triplets:
W_ft ≈ U_k Σ_k V_k^T
where k is chosen to keep 95% of the Frobenius norm. This is not a delta compression technique — it operates on the single fine-tuned checkpoint directly. But for weak checkpoints where BitDelta degrades, MonoSoup recovers up to 8% of the lost behavioral accuracy at comparable memory cost.
In our pipeline: we try BitDelta first. If behavioral retention falls below 98.5% on our internal benchmark suite, we fall back to MonoSoup with k calibrated to budget.
K-Merge: Edge Adapter Management
Paper: arXiv:2510.13537
The cloud serving stack above does not address edge deployment. A local user running zen-nano on a 16GB laptop cannot afford a delta cache for 14 variants — even at 1-bit compression, storing all nano variants would consume significant RAM.
K-Merge addresses this with an online LoRA adapter pool under fixed storage budget. The algorithm maintains a priority queue of adapters scored by utility:
utility(adapter_i) = request_frequency(i) × behavioral_gain(i) / storage_cost(i)
When the budget is exceeded, the lowest-utility adapter is evicted. Utility scores are updated online using exponential decay, so recently used adapters are preferred over historical ones.
For a 16GB laptop with 4GB allocated to the adapter pool, K-Merge keeps 6-8 zen-nano variants hot simultaneously, with eviction latency of ~200ms to load a new adapter from local disk.
Full Zen Serving Stack
┌──────────────────────────────┐
│ Request Router │
│ (model ID → variant key) │
└────────────┬─────────────────┘
│
┌────────────▼─────────────────┐
│ Delta Cache (Redis) │
│ sign_bits + alpha per layer │
│ ~9 GB per 72B variant │
└────────────┬─────────────────┘
│ cache hit
┌────────────▼─────────────────┐
│ Shared Base Model (BF16) │
│ zen-max 72B: 144 GB on A100s │
│ zen-nano 0.6B: 1.2 GB │
└────────────┬─────────────────┘
│
┌────────────▼─────────────────┐
│ Fused Reconstruction Kernel │
│ W_ft = W_base + α·sign(Δ) │
│ < 1ms overhead per layer │
└────────────┬─────────────────┘
│
┌────────────▼─────────────────┐
│ Inference Engine │
│ (vLLM with continuous batch) │
└──────────────────────────────┘
The architecture keeps one base model loaded per GPU cluster. All variants share it. The delta cache fits in Redis (NVMe-backed), loading on demand in under 50ms. In practice, our top-5 variants stay hot in Redis memory; the remaining variants load from NVMe on first request.
GPU Memory Reduction
| Scenario | Without BitDelta | With BitDelta | Savings |
|---|---|---|---|
| 3× zen-nano variants | 3.6 GB | 1.4 GB | 61% |
| 5× zen4-max variants | 300 GB | 192 GB | 36% |
| 2× zen-max variants | 288 GB | 162 GB | 44% |
| Full 14-model catalog | ~2.8 TB | ~2.2 TB | 21% |
The savings are most dramatic at the smaller scales where we have many more behavioral variants. For the ultra-scale models (zen4-ultra 1T+), a single checkpoint dominates, and BitDelta’s contribution is smaller — but MonoSoup and K-Merge become more relevant for edge quantization.
The combination of BitDelta for cloud serving, MonoSoup for quality recovery, and K-Merge for edge devices gives us a coherent three-tier compression story across the full Zen catalog.
Zen LM is a joint initiative of Hanzo AI Inc. (Techstars ‘17) and Zoo Labs Foundation (501c3).