Fine-Tuning Strategy: QLoRA
Full fine-tuning of a 2-billion-parameter model would require immense VRAM. We adopted QLoRA (Quantized Low-Rank Adaptation), an extremely memory-efficient technique.
- 4-bit Quantization: The base model is loaded with its weights quantized to 4-bit precision (
load_in_4bit=True
). This reduces the memory footprint of the static, non-trainable part of the model by a factor of 4. - Low-Rank Adapters (LoRA): Instead of training the entire model, we only train small, "low-rank" matrices that are injected into the attention and feed-forward layers (
target_modules
inconfig.py
) - Paged Optimizers: We use the
paged_adamw_8bit
optimizer, which pages optimizer states to CPU RAM when GPU VRAM is exhausted, allowing us to train with larger batch sizes than would otherwise be possible.
This approach, streamlined by Unsloth's FastModel
class, allows for fine-tuning on a single consumer-grade GPU.
Reference