⚡ Zen LM

Datasets

Zen Agentic Dataset - 8.47 billion tokens of real-world agentic programming

Zen Agentic Dataset

8.47 Billion Tokens of real-world agentic programming data.

Quick Stats

MetricValue
Total Tokens8.47 billion
Training Samples3.35 million
Validation Samples100,000
Total Size27 GB
Repositories1,452
Time Span15 years (2010-2025)

Data Composition

ComponentTokensPercentage
Git History4.03B48%
Agentic Debug Sessions2.42B29%
Architecture Discussions1.14B13%
Code Review Sessions0.86B10%

Domain Coverage

Agentic AI & LLM Infrastructure

  • Model Context Protocol (MCP) - 260+ tool implementations
  • Multi-agent orchestration
  • Agent frameworks - planning, memory, reflection
  • LLM Gateway - 100+ provider proxy

Web3 & Blockchain

  • Smart contracts - Solidity, Vyper
  • Consensus engines - Snow family, BFT, DAG
  • Cross-chain bridges
  • DeFi protocols - AMMs, lending, staking

Cryptography & Security

  • Post-quantum - Kyber, Dilithium, SPHINCS+
  • Threshold cryptography - MPC, DKG
  • Zero-knowledge proofs
  • Key management - HD wallets

Modern Development

  • Full-stack TypeScript - Next.js, React
  • Systems - Rust, Go, Python, C/C++
  • DevOps - Docker, Kubernetes, CI/CD
  • Real-time systems - Event sourcing, CQRS

Languages

Tier 1 (Core)Tier 2 (Infrastructure)Tier 3 (Specialized)
PythonSQLSolidity
TypeScriptBash/ShellC/C++
JavaScriptYAML/TOMLProtobuf
RustDockerfileGraphQL
GoMakefileMove

What Makes This Unique

Real Agentic Programming

Unlike synthetic datasets, this contains actual agentic programming sessions showing:

  • Real debugging workflows
  • Multi-file refactoring decisions
  • Architecture discussions
  • Tool use patterns
  • Error recovery

Production Code Quality

  • Code that shipped to production systems
  • Security-audited smart contracts
  • Performance-optimized infrastructure
  • Battle-tested patterns from real deployments

Access & Licensing

This dataset is available for research and commercial licensing.

Request Access

Email: z@hanzo.ai

Please include:

  • Intended use case (training, research, evaluation)
  • Organization/affiliation
  • Target ecosystem (if applicable)
  • Licensing requirements

HuggingFace

View the public preview: hanzoai/zen-agentic-dataset

On this page