← Back to Blog

Training Gym: A Platform for Open Model Development

Announcing Training Gym, our open platform for collaborative large model training.

By Zach Kelling
InfrastructureTrainingOpen Source

Training large language models requires more than algorithms. It requires infrastructure: distributed training frameworks, data pipelines, experiment tracking, and evaluation harnesses. Today we open source Training Gym, our complete platform for model development.

Why Training Gym?

Open AI development faces an infrastructure gap. Publishing model weights is valuable, but it’s not enough. Researchers need:

Training Gym provides all of this in an integrated, open source package.

Architecture


    +------------------+     +------------------+     +------------------+
    |   Data Pipeline  | --> |  Training Loop   | --> |   Evaluation     |
    +------------------+     +------------------+     +------------------+
             |                       |                        |
             v                       v                        v
    +------------------+     +------------------+     +------------------+
    |   Data Registry  |     | Checkpoint Store |     |   Metrics DB     |
    +------------------+     +------------------+     +------------------+
                                    |
                                    v
                        +------------------+
                        |   Experiment     |
                        |   Tracker        |
                        +------------------+
    

Data Pipeline

The data pipeline handles:


    from training_gym.data import DataPipeline, MixedDataset
    
    pipeline = DataPipeline(
        sources=[
            ("s3://data/books", 0.3),
            ("s3://data/web", 0.5),
            ("s3://data/code", 0.2),
        ],
        tokenizer="zoo-labs/zen-tokenizer",
        sequence_length=2048,
    )
    
    dataset = pipeline.build()
    

Distributed Training

Training Gym supports multiple distributed training strategies:

Configuration is declarative:


    distributed:
      strategy: fsdp
      world_size: 64
      sharding_strategy: full_shard
      mixed_precision: bf16
      gradient_checkpointing: true
    

Training Loop

The training loop is modular and extensible:


    from training_gym import Trainer, TrainingConfig
    
    config = TrainingConfig(
        model="zen-7b",
        optimizer="adamw",
        learning_rate=1e-4,
        batch_size=2048,
        max_steps=100000,
        warmup_steps=2000,
        weight_decay=0.1,
    )
    
    trainer = Trainer(config)
    trainer.fit(dataset)
    

Built-in features:

Evaluation

Standardized evaluation across common benchmarks:


    from training_gym.eval import Evaluator
    
    evaluator = Evaluator(
        benchmarks=["mmlu", "hellaswag", "winogrande", "arc"],
        model=model,
    )
    
    results = evaluator.run()
    

Supported benchmarks:

Experiment Tracking

Every training run is tracked:


    from training_gym import Experiment
    
    with Experiment("zen-7b-v2") as exp:
        exp.log_config(config)
        trainer.fit(dataset)
        exp.log_metrics(results)
        exp.log_artifacts(["model.pt", "tokenizer/"])
    

The experiment tracker records:

Reproducibility

Training Gym emphasizes reproducibility:

Deterministic Training


    reproducibility:
      seed: 42
      deterministic_algorithms: true
      cublas_workspace_config: ":4096:8"
    

Same seed, same results (within floating point precision).

Environment Capture

Every experiment records:

Configuration as Code

All configs are versioned YAML:


    # experiments/zen-7b-v2.yaml
    model:
      architecture: llama
      hidden_size: 4096
      num_layers: 32
      num_heads: 32
      vocab_size: 32000
    
    training:
      batch_size: 2048
      learning_rate: 3e-4
      max_steps: 150000
    

Community Features

Training Gym includes tools for collaborative development:

Model Registry

Share and discover models:


    from training_gym.registry import ModelRegistry
    
    registry = ModelRegistry()
    
    # Publish a model
    registry.push("my-org/my-model", model, config)
    
    # Load a model
    model = registry.pull("zoo-labs/zen-7b")
    

Leaderboards

Automatic benchmark submission:


    evaluator.submit_to_leaderboard(
        model_name="zen-7b-v2",
        organization="zoo-labs",
    )
    

Dataset Sharing


    from training_gym.data import DatasetRegistry
    
    # Share processed datasets
    DatasetRegistry.push("my-corpus", dataset, license="cc-by-4.0")
    
    # Load shared datasets
    dataset = DatasetRegistry.pull("zoo-labs/zen-pretrain-v1")
    

Getting Started

Installation


    pip install training-gym
    
    # For distributed training
    pip install training-gym[distributed]
    
    # For evaluation suite
    pip install training-gym[eval]
    

Quick Start


    from training_gym import quickstart
    
    # Train a small model to verify setup
    quickstart.train_tiny_model()
    
    # Run evaluation suite
    quickstart.evaluate_model("my-model")
    

Documentation

Full documentation at docs.training-gym.ai:

Roadmap

Q4 2023 : Multi-modal training support Q1 2024 : Reinforcement learning from human feedback (RLHF) integration Q2 2024 : Federated training support Q3 2024 : Automated hyperparameter optimization

Conclusion

Open AI development needs open infrastructure. Training Gym provides the tools to train, evaluate, and share models. Join us in building the future of open AI.

Repository: github.com/zoo-labs/training-gym


Zach Kelling is a co-founder of Zoo Labs Foundation.