CUDA Training (Local GPU)

Train zen-coder-flash on NVIDIA GPUs with QLoRA.

Requirements

NVIDIA GPU with 24GB+ VRAM
CUDA 12.1+
Python 3.10+

Installation

pip install torch transformers accelerate peft bitsandbytes datasets

Single GPU Training

# Clone the repo
git clone https://github.com/zenlm/zen-coder-flash
cd zen-coder-flash

# Train with QLoRA
python training/train_cuda.py

Multi-GPU Training

# 4 GPUs
torchrun --nproc_per_node 4 training/train_cuda.py

Options

Option	Default	Description
`--epochs`	3	Training epochs
`--batch-size`	2	Per-device batch size
`--lr`	1e-4	Learning rate
`--lora-rank`	64	LoRA rank

Configuration

The training uses 4-bit quantization (QLoRA):

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

LoRA Configuration

lora_config = LoraConfig(
    r=64,
    lora_alpha=128,
    lora_dropout=0.05,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    bias="none",
    task_type="CAUSAL_LM",
)

Expected Time

~2 hours on RTX 4090 (24GB VRAM).

CUDA Training

On this page