Zen Agentic Dataset
8.47 Billion Tokens of Real-World Agentic Programming
A comprehensive training dataset combining agentic programming interactions with full git history from 1,400+ repositories spanning 15 years of professional development across AI, Web3, cryptography, and modern software engineering.
Quick Stats
Total Tokens
Training Samples
Validation Samples
Total Size
Repositories
Time Span (2010-2025)
Data Composition
| Component | Tokens | Percentage |
|---|---|---|
| Git History | 4.03B | 48% |
| Agentic Debug Sessions | 2.42B | 29% |
| Architecture Discussions | 1.14B | 13% |
| Code Review Sessions | 0.86B | 10% |
Domain Coverage
Agentic AI & LLM Infrastructure
- Model Context Protocol (MCP) - 260+ tools
- Multi-agent orchestration
- Agent frameworks - planning, memory, reflection
- LLM Gateway - 100+ provider proxy
Web3 & Blockchain
- Smart contracts - Solidity, Vyper
- Consensus engines - Snow family, BFT, DAG
- Cross-chain bridges
- DeFi protocols - AMMs, lending, staking
Cryptography & Security
- Post-quantum - Kyber, Dilithium, SPHINCS+
- Threshold cryptography - MPC, DKG
- Zero-knowledge proofs
- Key management - HD wallets
Modern Development
- Full-stack TypeScript - Next.js, React
- Systems - Rust, Go, Python, C/C++
- DevOps - Docker, Kubernetes, CI/CD
- Real-time systems - Event sourcing, CQRS
Languages
| Tier 1 (Core) | Tier 2 (Infrastructure) | Tier 3 (Specialized) |
|---|---|---|
| Python | SQL | Solidity |
| TypeScript | Bash/Shell | C/C++ |
| JavaScript | YAML/TOML | Protobuf |
| Rust | Dockerfile | GraphQL |
| Go | Makefile | Move |
Models Trained on This Dataset
| Model | Size | Architecture | Status |
|---|---|---|---|
| Zen Coder 4B | 4B | Qwen3 | Trained |
| Zen Coder 24B | 24B | Devstral Small 2 | Trained |
| Zen Coder 123B | 123B | Devstral 2 | Training |
| Zen Coder Max | 358B | GLM-4.7 (MoE) | Planned |
| Zen Coder Ultra | 1T | Kimi K2 (MoE) | Planned |
What Makes This Dataset Unique
Real Agentic Programming
Unlike synthetic datasets, this contains actual agentic programming sessions showing real debugging workflows, multi-file refactoring decisions, architecture discussions, tool use patterns, and error recovery.
Production Code Quality
Code that shipped to production systems. Security-audited smart contracts. Performance-optimized infrastructure. Battle-tested patterns from real deployments.
Access & Licensing
This dataset is available for research and commercial licensing.
For Developers & Researchers
We award grants to individuals and teams building:
- Models for specific blockchain ecosystems
- Open-source AI tools using OpenAI-compatible protocols
- Research advancing agentic AI capabilities
- Infrastructure for decentralized AI training
Request Access
Email: z@hanzo.ai
Please include:
- Intended use case (training, research, evaluation)
- Organization/affiliation
- Target ecosystem (if applicable)
- Licensing requirements
Training Framework
Use zen-trainer for fine-tuning:
from zen_trainer import ZenTrainer
trainer = ZenTrainer(
model_key="qwen3-4b",
dataset_path="hanzoai/zen-agentic-dataset-private", # Requires access
output_dir="./output/my-model",
)
trainer.train()Supported Organizations
| Organization | Focus | Role |
|---|---|---|
| Hanzo AI | AI infrastructure | Primary maintainer |
| Zen LM | Open model research | Model training |
| Zoo Labs | Decentralized AI | Research grants |
| Lux Network | AI compute settlement | Infrastructure |