GITHUB HUGGING FACE TRY ZEN CHAT

Zen4 Ultra is the most capable model in the Zen4 family. It is a Mixture of Distilled Experts model with 480B total parameters and 35B active parameters per forward pass. The native context window is 256K tokens, extending to 1M tokens with YaRN extrapolation.

Architecture

PropertyValue
Total parameters480B
Active parameters per token35B
Experts per layer128
Top-k routing8
Context window (native)256K
Context window (YaRN)1M
Vocabulary size151,936
Attention heads64
KV heads (GQA)8
Layers94

Benchmark Results

General Reasoning

BenchmarkZen4 UltraZen Max 72B
MMLU89.487.1
MMLU-Pro75.271.8
ARC-Challenge72.168.4
HellaSwag92.390.1
Winogrande87.685.2

Mathematics

BenchmarkZen4 UltraZen Max 72B
MATH81.473.2
GSM8K95.392.1
AMC 202362.454.7
AIME 202448.237.6

Code

BenchmarkZen4 UltraZen Max 72B
HumanEval91.282.4
MBPP87.681.3
LiveCodeBench52.444.1
SWE-bench Verified45.738.2

Long Context

TaskScore at 32KScore at 128KScore at 512K
NIAH recall99.1%98.4%94.7%
Summarization48.246.943.1
QA over long doc74.371.264.8

Long-context performance remains strong through 512K tokens, with graceful degradation thereafter.

Multilingual

Evaluated on 30 languages across MMMLU:

Language GroupScore
Latin script (high-resource)86.4
Latin script (low-resource)72.1
CJK81.3
Arabic/Hebrew76.8
Other non-Latin68.2

Use Cases

Complex Research and Analysis

Zen4 Ultra excels at tasks requiring synthesis across long documents:

  • Analyzing regulatory filings spanning hundreds of pages
  • Cross-referencing scientific literature for systematic reviews
  • Multi-document legal analysis with citation tracking
  • Financial model analysis with full spreadsheet context

The 1M token context allows loading entire codebases, large document sets, or extended conversation histories without truncation.

Multi-Step Reasoning

For problems requiring planning and backtracking — competitive math, logic puzzles, complex software architecture decisions — Ultra’s depth provides measurable advantage over smaller models.

Agentic Workflows

Ultra’s function calling reliability is critical for long-running agent tasks:

  • SWE-bench Verified: 45.7% (full-repo software engineering tasks)
  • Tool selection accuracy: 94.2% on held-out tool-use evaluation
  • Multi-turn instruction adherence: 91.8%

Code Generation

Near-human performance on competitive programming tasks. Generates complete, working implementations of complex algorithms in all major languages.

Running Zen4 Ultra

from vllm import LLM, SamplingParams

llm = LLM(
    model="hanzoai/zen4-ultra",
    tensor_parallel_size=8,   # 8x H100 80GB
    max_model_len=131072,
)

outputs = llm.generate(
    ["Explain the Zen MoDE architecture in detail."],
    SamplingParams(temperature=0.7, max_tokens=2048),
)

Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "hanzoai/zen4-ultra",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("hanzoai/zen4-ultra")

messages = [{"role": "user", "content": "Solve: find all integer solutions to x^3 + y^3 = z^3."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=1024)
print(tokenizer.decode(output[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))

Hardware Requirements

ConfigurationVRAMThroughput
8x H100 80GB640GB~2,400 tok/s
16x A100 80GB1280GB~1,100 tok/s
32x A100 40GB1280GB~600 tok/s

For cost-sensitive production use cases, Zen Max 72B delivers most of Ultra’s capability at a fraction of the compute.

License

Apache-2.0. Commercial use permitted. No royalty or usage fees.


Zen4 Ultra is available now on Hugging Face. For API access, see hanzo.ai.