← Back to Blog

Decentralized Compute for AI Training

How we're building a decentralized compute network for training large AI models.

By Zach Kelling
InfrastructureTrainingDecentralization

Training large AI models requires significant compute resources. These resources are concentrated in a few hyperscalers, creating bottlenecks and single points of control. Today we announce the Zoo Compute Network, a decentralized alternative.

The Compute Concentration Problem

Current AI training is dominated by:

This concentration creates risks:

  1. Access barriers : Only well-funded organizations can train frontier models
  2. Single points of failure : Outages affect entire research programs
  3. Regulatory exposure : One jurisdiction can impact global AI development
  4. Vendor lock-in : Switching costs are enormous

The Zoo Compute Network

The Zoo Compute Network aggregates distributed GPU resources into a unified training substrate. Anyone with suitable hardware can contribute. Anyone can access the aggregated compute.

Architecture


    +------------------+     +------------------+     +------------------+
    |  Compute Node 1  |     |  Compute Node 2  |     |  Compute Node N  |
    |  (8x H100)       |     |  (4x A100)       |     |  (16x H100)      |
    +--------+---------+     +--------+---------+     +--------+---------+
             |                        |                        |
             v                        v                        v
    +-----------------------------------------------------------------------+
    |                         Coordination Layer                             |
    |  - Task scheduling                                                     |
    |  - Gradient aggregation                                                |
    |  - Checkpoint management                                               |
    |  - Payment settlement                                                  |
    +-----------------------------------------------------------------------+
                                      |
                                      v
    +-----------------------------------------------------------------------+
    |                           Client Interface                             |
    |  - Job submission                                                      |
    |  - Progress monitoring                                                 |
    |  - Result retrieval                                                    |
    +-----------------------------------------------------------------------+
    

Node Requirements

Compute nodes must meet minimum specifications:

TierGPUMemoryNetworkUptime SLA
Bronze4x A100 40GB256GB100 Gbps95%
Silver8x A100 80GB512GB200 Gbps99%
Gold8x H100 80GB1TB400 Gbps99.5%

Nodes are verified through proof-of-work benchmarks before joining the network.

Coordination Protocol

The coordination layer handles distributed training logistics:

Task Scheduling

Jobs are divided into tasks and assigned to available nodes:


    # Job submission
    job = ComputeJob(
        model_config=model_config,
        data_config=data_config,
        training_config=training_config,
        budget_max=10000,  # ZEN tokens
    )
    
    job_id = await network.submit(job)
    

The scheduler optimizes for:

Gradient Aggregation

Distributed training requires gradient synchronization. The network supports:

Aggregation happens through a tree topology that minimizes bandwidth usage.

Checkpoint Management

Training state is continuously checkpointed:


    # Automatic checkpointing
    checkpoint_config = CheckpointConfig(
        interval=1000,  # steps
        storage="ipfs",
        redundancy=3,
    )
    

Checkpoints are content-addressed and distributed. Training can resume from any node.

Economics

For Compute Providers

Providers stake ZEN tokens as collateral and earn rewards for:

Slashing occurs for:

Expected returns: 15-25% APY on staked tokens plus hardware depreciation coverage.

For Users

Users pay per compute-hour in ZEN tokens:

TierPrice (ZEN/GPU-hour)Approx USD
Bronze2.5$5
Silver4.0$8
Gold8.0$16

Prices are market-determined through ongoing auctions. Users can specify maximum price and wait for availability.

Network Fee

5% of payments go to the network treasury, funding:

Verification

How do we know compute was done correctly?

Sampling-Based Verification

Random subsets of computation are re-run by verifiers. Discrepancies trigger investigation:


    P(detection) = 1 - (1 - sample_rate)^n
    

With 1% sampling and 100 steps, detection probability is 63%. With 5% sampling, it’s 99.4%.

Gradient Consistency

Aggregated gradients are checked for statistical anomalies. Fabricated gradients have detectable patterns.

Trusted Execution (Optional)

For high-value jobs, nodes can run in TEE enclaves (SGX, TDX). Provides cryptographic attestation of correct execution.

Real-World Performance

We’ve run several training jobs on the network:

Zen-2-7B Training

Embedding Model Training

Efficiency losses come from coordination overhead and network heterogeneity. Ongoing optimizations are closing the gap.

Joining the Network

As a Compute Provider

  1. Hardware check : Verify your setup meets tier requirements
  2. Software install : Run the Zoo Compute daemon
  3. Stake : Lock ZEN tokens as collateral
  4. Benchmark : Complete verification benchmarks
  5. Operate : Maintain uptime and connectivity

Documentation: docs.zoo.ngo/compute/providers

As a User

  1. Install client : pip install zoo-compute
  2. Fund account : Acquire ZEN tokens
  3. Submit jobs : Use API or CLI

    from zoo_compute import Client
    
    client = Client(api_key="...")
    
    job = client.train(
        config="./training_config.yaml",
        max_budget=5000,
    )
    
    await job.wait()
    

Documentation: docs.zoo.ngo/compute/users

Roadmap

Q3 2024 : Public beta with 100+ nodes Q4 2024 : Production launch, verification improvements Q1 2025 : Cross-chain bridging for payments Q2 2025 : TEE support for all tiers

Conclusion

Decentralized compute is essential for decentralized AI. The Zoo Compute Network provides a permissionless, efficient, and verifiable substrate for training large models. As the network grows, it becomes more resilient and more accessible.

Compute should be a utility, not a moat.


Zach Kelling is a co-founder of Zoo Labs Foundation.