Training large AI models requires significant compute resources. These resources are concentrated in a few hyperscalers, creating bottlenecks and single points of control. Today we announce the Zoo Compute Network, a decentralized alternative.
The Compute Concentration Problem
Current AI training is dominated by:
- Cloud providers: AWS, GCP, Azure control most AI-grade compute
- Hardware scarcity: H100s have year-long waitlists
- High costs: Training GPT-4 class models costs $100M+
- Geographic concentration: Most clusters are in a few regions
This concentration creates risks:
- Access barriers: Only well-funded organizations can train frontier models
- Single points of failure: Outages affect entire research programs
- Regulatory exposure: One jurisdiction can impact global AI development
- Vendor lock-in: Switching costs are enormous
The Zoo Compute Network
The Zoo Compute Network aggregates distributed GPU resources into a unified training substrate. Anyone with suitable hardware can contribute. Anyone can access the aggregated compute.
Architecture
+------------------+ +------------------+ +------------------+
| Compute Node 1 | | Compute Node 2 | | Compute Node N |
| (8x H100) | | (4x A100) | | (16x H100) |
+--------+---------+ +--------+---------+ +--------+---------+
| | |
v v v
+-----------------------------------------------------------------------+
| Coordination Layer |
| - Task scheduling |
| - Gradient aggregation |
| - Checkpoint management |
| - Payment settlement |
+-----------------------------------------------------------------------+
|
v
+-----------------------------------------------------------------------+
| Client Interface |
| - Job submission |
| - Progress monitoring |
| - Result retrieval |
+-----------------------------------------------------------------------+
Node Requirements
Compute nodes must meet minimum specifications:
| Tier | GPU | Memory | Network | Uptime SLA |
|---|---|---|---|---|
| Bronze | 4x A100 40GB | 256GB | 100 Gbps | 95% |
| Silver | 8x A100 80GB | 512GB | 200 Gbps | 99% |
| Gold | 8x H100 80GB | 1TB | 400 Gbps | 99.5% |
Nodes are verified through proof-of-work benchmarks before joining the network.
Coordination Protocol
The coordination layer handles distributed training logistics:
Task Scheduling
Jobs are divided into tasks and assigned to available nodes:
# Job submission
job = ComputeJob(
model_config=model_config,
data_config=data_config,
training_config=training_config,
budget_max=10000, # ZEN tokens
)
job_id = await network.submit(job)
The scheduler optimizes for:
- Data locality (minimize transfers)
- Network topology (co-locate communicating nodes)
- Cost efficiency (use cheapest suitable nodes)
- Reliability (distribute across failure domains)
Gradient Aggregation
Distributed training requires gradient synchronization. The network supports:
- All-reduce for data-parallel training
- Point-to-point for pipeline/tensor parallelism
- Asynchronous updates for fault tolerance
Aggregation happens through a tree topology that minimizes bandwidth usage.
Checkpoint Management
Training state is continuously checkpointed:
# Automatic checkpointing
checkpoint_config = CheckpointConfig(
interval=1000, # steps
storage="ipfs",
redundancy=3,
)
Checkpoints are content-addressed and distributed. Training can resume from any node.
Economics
For Compute Providers
Providers stake ZEN tokens as collateral and earn rewards for:
- Uptime (base reward)
- Computation completed (work reward)
- Network contribution (bandwidth bonus)
Slashing occurs for:
- Downtime during committed periods
- Incorrect computation (detected via verification)
- Bandwidth violations
Expected returns: 15-25% APY on staked tokens plus hardware depreciation coverage.
For Users
Users pay per compute-hour in ZEN tokens:
| Tier | Price (ZEN/GPU-hour) | Approx USD |
|---|---|---|
| Bronze | 2.5 | $5 |
| Silver | 4.0 | $8 |
| Gold | 8.0 | $16 |
Prices are market-determined through ongoing auctions. Users can specify maximum price and wait for availability.
Network Fee
5% of payments go to the network treasury, funding:
- Protocol development
- Security audits
- Community grants
Verification
How do we know compute was done correctly?
Sampling-Based Verification
Random subsets of computation are re-run by verifiers. Discrepancies trigger investigation:
P(detection) = 1 - (1 - sample_rate)^n
With 1% sampling and 100 steps, detection probability is 63%. With 5% sampling, it’s 99.4%.
Gradient Consistency
Aggregated gradients are checked for statistical anomalies. Fabricated gradients have detectable patterns.
Trusted Execution (Optional)
For high-value jobs, nodes can run in TEE enclaves (SGX, TDX). Provides cryptographic attestation of correct execution.
Real-World Performance
We’ve run several training jobs on the network:
Zen-2-7B Training
- Duration: 3 weeks
- Nodes used: 24 (rotating pool of 40)
- Total compute: 8,400 GPU-hours
- Cost: 21,000 ZEN (~$42,000)
- Efficiency: 89% of centralized equivalent
Embedding Model Training
- Duration: 5 days
- Nodes used: 8
- Total compute: 960 GPU-hours
- Cost: 2,400 ZEN (~$4,800)
- Efficiency: 94% of centralized equivalent
Efficiency losses come from coordination overhead and network heterogeneity. Ongoing optimizations are closing the gap.
Joining the Network
As a Compute Provider
- Hardware check: Verify your setup meets tier requirements
- Software install: Run the Zoo Compute daemon
- Stake: Lock ZEN tokens as collateral
- Benchmark: Complete verification benchmarks
- Operate: Maintain uptime and connectivity
Documentation: docs.zoo.ngo/compute/providers
As a User
- Install client:
pip install zoo-compute - Fund account: Acquire ZEN tokens
- Submit jobs: Use API or CLI
from zoo_compute import Client
client = Client(api_key="...")
job = client.train(
config="./training_config.yaml",
max_budget=5000,
)
await job.wait()
Documentation: docs.zoo.ngo/compute/users
Roadmap
Q3 2024: Public beta with 100+ nodes Q4 2024: Production launch, verification improvements Q1 2025: Cross-chain bridging for payments Q2 2025: TEE support for all tiers
Conclusion
Decentralized compute is essential for decentralized AI. The Zoo Compute Network provides a permissionless, efficient, and verifiable substrate for training large models. As the network grows, it becomes more resilient and more accessible.
Compute should be a utility, not a moat.
Zach Kelling is a co-founder of Zoo Labs Foundation.