Fine-Tuning Strategy: QLoRA

Full fine-tuning of a 2-billion-parameter model would require immense VRAM. We adopted QLoRA (Quantized Low-Rank Adaptation), an extremely memory-efficient technique.

4-bit Quantization: The base model is loaded with its weights quantized to 4-bit precision (load_in_4bit=True). This reduces the memory footprint of the static, non-trainable part of the model by a factor of 4.
Low-Rank Adapters (LoRA): Instead of training the entire model, we only train small, "low-rank" matrices that are injected into the attention and feed-forward layers (target_modules in config.py)
Paged Optimizers: We use the paged_adamw_8bit optimizer, which pages optimizer states to CPU RAM when GPU VRAM is exhausted, allowing us to train with larger batch sizes than would otherwise be possible.

This approach, streamlined by Unsloth's FastModel class, allows for fine-tuning on a single consumer-grade GPU.

Reference

Dettmers, et al. (2023). "QLoRA: Efficient Finetuning of Quantized LLMs"

Keyboard shortcuts

Gemergency iOS app docs

Fine-Tuning Strategy: QLoRA