Training
CUDA Training
Train locally with NVIDIA GPUs
CUDA Training (Local GPU)
Train zen-coder-flash on NVIDIA GPUs with QLoRA.
Requirements
- NVIDIA GPU with 24GB+ VRAM
- CUDA 12.1+
- Python 3.10+
Installation
pip install torch transformers accelerate peft bitsandbytes datasetsSingle GPU Training
# Clone the repo
git clone https://github.com/zenlm/zen-coder-flash
cd zen-coder-flash
# Train with QLoRA
python training/train_cuda.pyMulti-GPU Training
# 4 GPUs
torchrun --nproc_per_node 4 training/train_cuda.pyOptions
| Option | Default | Description |
|---|---|---|
--epochs | 3 | Training epochs |
--batch-size | 2 | Per-device batch size |
--lr | 1e-4 | Learning rate |
--lora-rank | 64 | LoRA rank |
Configuration
The training uses 4-bit quantization (QLoRA):
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
)LoRA Configuration
lora_config = LoraConfig(
r=64,
lora_alpha=128,
lora_dropout=0.05,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
bias="none",
task_type="CAUSAL_LM",
)Expected Time
~2 hours on RTX 4090 (24GB VRAM).