Training
Cloud Training
Full-scale training on 8x H200 GPUs
Cloud Training (8x H200)
Full-scale training on Nebius or similar cloud providers.
Requirements
- 8x NVIDIA H200 (141GB each)
- SLURM cluster or Docker environment
- ~8 hours training time
- ~$288 estimated cost
Configuration
Training config at training/configs/8xh200.yaml:
# Model
model_name: zenlm/zen-coder-flash
output_dir: ./zen-coder-flash-lora
# Hardware
num_gpus: 8
gpu_type: h200
total_batch_size: 128
per_device_batch_size: 2
gradient_accumulation_steps: 8
# LoRA
lora_rank: 64
lora_alpha: 128
lora_dropout: 0.05
target_modules:
- q_proj
- k_proj
- v_proj
- o_proj
# Training
learning_rate: 1e-4
num_train_epochs: 3
warmup_ratio: 0.03
max_seq_length: 8192
# Dataset
dataset: hanzoai/zen-agentic-dataset-privateLaunch Training
# Clone the repo
git clone https://github.com/zenlm/zen-coder-flash
cd zen-coder-flash
# Dry run
python training/launch_training.py --dry-run
# Launch on Nebius
python training/launch_training.py --config training/configs/8xh200.yaml
# Local Docker (for testing)
python training/launch_training.py --localSLURM Job
The launcher generates a SLURM job script:
#!/bin/bash
#SBATCH --job-name=zen-coder-flash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --gpus-per-node=8
#SBATCH --time=24:00:00
srun python training/scripts/train.pyHuggingFace Spaces Alternative
For smaller scale training, use HuggingFace Spaces:
- Create a new HF Space with GPU (T4/A10G/A100)
- Upload
training/hf_space/app.pyandrequirements.txt - Train via Gradio UI
Cost: ~$0.60/hr for T4 GPU.