Zen Agentic Dataset

8.47 Billion Tokens of Real-World Agentic Programming

A comprehensive training dataset combining agentic programming interactions with full git history from 1,400+ repositories spanning 15 years of professional development across AI, Web3, cryptography, and modern software engineering.

Quick Stats

8.47B

Total Tokens

3.35M

Training Samples

100K

Validation Samples

27 GB

Total Size

1,452

Repositories

15 yr

Time Span (2010-2025)

Data Composition

ComponentTokensPercentage
Git History4.03B48%
Agentic Debug Sessions2.42B29%
Architecture Discussions1.14B13%
Code Review Sessions0.86B10%

Domain Coverage

🤖

Agentic AI & LLM Infrastructure

  • Model Context Protocol (MCP) - 260+ tools
  • Multi-agent orchestration
  • Agent frameworks - planning, memory, reflection
  • LLM Gateway - 100+ provider proxy
⛓️

Web3 & Blockchain

  • Smart contracts - Solidity, Vyper
  • Consensus engines - Snow family, BFT, DAG
  • Cross-chain bridges
  • DeFi protocols - AMMs, lending, staking
🔐

Cryptography & Security

  • Post-quantum - Kyber, Dilithium, SPHINCS+
  • Threshold cryptography - MPC, DKG
  • Zero-knowledge proofs
  • Key management - HD wallets
💻

Modern Development

  • Full-stack TypeScript - Next.js, React
  • Systems - Rust, Go, Python, C/C++
  • DevOps - Docker, Kubernetes, CI/CD
  • Real-time systems - Event sourcing, CQRS

Languages

Tier 1 (Core)Tier 2 (Infrastructure)Tier 3 (Specialized)
PythonSQLSolidity
TypeScriptBash/ShellC/C++
JavaScriptYAML/TOMLProtobuf
RustDockerfileGraphQL
GoMakefileMove

Models Trained on This Dataset

ModelSizeArchitectureStatus
Zen Coder 4B4BQwen3Trained
Zen Coder 24B24BDevstral Small 2Trained
Zen Coder 123B123BDevstral 2Training
Zen Coder Max358BGLM-4.7 (MoE)Planned
Zen Coder Ultra1TKimi K2 (MoE)Planned

What Makes This Dataset Unique

Real Agentic Programming

Unlike synthetic datasets, this contains actual agentic programming sessions showing real debugging workflows, multi-file refactoring decisions, architecture discussions, tool use patterns, and error recovery.

Production Code Quality

Code that shipped to production systems. Security-audited smart contracts. Performance-optimized infrastructure. Battle-tested patterns from real deployments.

Access & Licensing

This dataset is available for research and commercial licensing.

For Developers & Researchers

We award grants to individuals and teams building:

  • Models for specific blockchain ecosystems
  • Open-source AI tools using OpenAI-compatible protocols
  • Research advancing agentic AI capabilities
  • Infrastructure for decentralized AI training

Request Access

Email: z@hanzo.ai

Please include:

  • Intended use case (training, research, evaluation)
  • Organization/affiliation
  • Target ecosystem (if applicable)
  • Licensing requirements
Request AccessView on HuggingFace

Training Framework

Use zen-trainer for fine-tuning:

from zen_trainer import ZenTrainer

trainer = ZenTrainer(
    model_key="qwen3-4b",
    dataset_path="hanzoai/zen-agentic-dataset-private",  # Requires access
    output_dir="./output/my-model",
)
trainer.train()
View zen-trainer

Supported Organizations

OrganizationFocusRole
Hanzo AIAI infrastructurePrimary maintainer
Zen LMOpen model researchModel training
Zoo LabsDecentralized AIResearch grants
Lux NetworkAI compute settlementInfrastructure