⚡ Zen LM
Training

CUDA Training

Train locally with NVIDIA GPUs

CUDA Training (Local GPU)

Train zen-coder-flash on NVIDIA GPUs with QLoRA.

Requirements

  • NVIDIA GPU with 24GB+ VRAM
  • CUDA 12.1+
  • Python 3.10+

Installation

pip install torch transformers accelerate peft bitsandbytes datasets

Single GPU Training

# Clone the repo
git clone https://github.com/zenlm/zen-coder-flash
cd zen-coder-flash

# Train with QLoRA
python training/train_cuda.py

Multi-GPU Training

# 4 GPUs
torchrun --nproc_per_node 4 training/train_cuda.py

Options

OptionDefaultDescription
--epochs3Training epochs
--batch-size2Per-device batch size
--lr1e-4Learning rate
--lora-rank64LoRA rank

Configuration

The training uses 4-bit quantization (QLoRA):

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

LoRA Configuration

lora_config = LoraConfig(
    r=64,
    lora_alpha=128,
    lora_dropout=0.05,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    bias="none",
    task_type="CAUSAL_LM",
)

Expected Time

~2 hours on RTX 4090 (24GB VRAM).

On this page