LoRA (Low-Rank Adaptation) fine-tuning process. This is an advanced Parameter-Efficient Fine-Tuning (PEFT) method that is widely used for adapting large language models.
1. Background – Why LoRA?
- Challenge: Modern LLMs (GPT, LLaMA, BERT, etc.) have billions of parameters.
- Full fine-tuning → expensive (needs GPUs, storage).
- Updating all weights is impractical for enterprises or researchers.
- Solution: LoRA (Low-Rank Adaptation)
- Proposed by Microsoft (2021).
- Instead of updating full weight matrices, LoRA adds small, trainable low-rank matrices to specific layers.
- This drastically reduces trainable parameters while keeping model performance strong.
2. Core Idea of LoRA
- In transformer models, large matrices (e.g., Wq, Wk, Wv, Wo in attention) are updated during fine-tuning.
- LoRA freezes the original weights and adds:
- Two small matrices A and B (low-rank decomposition).
- Update rule:
W′=W+BAW’ = W + BAW′=W+BA
where:
- W = frozen pre-trained weights
- A, B = trainable low-rank matrices (rank << dimension of W)
- Only A and B are trained → reduces parameters drastically.
3. LoRA Fine-Tuning Workflow
Step 1: Choose Target Layers
- LoRA is usually applied to attention layers (query/key/value projections).
- Example: Apply to Wq, Wv in transformer blocks.
Step 2: Insert Low-Rank Decomposition
- For each target weight W∈Rd×kW \in \mathbb{R}^{d \times k}W∈Rd×k, add:
- Matrix A∈Rd×rA \in \mathbb{R}^{d \times r}A∈Rd×r (random initialization).
- Matrix B∈Rr×kB \in \mathbb{R}^{r \times k}B∈Rr×k.
- Rank rrr is small (e.g., 4, 8, 16).
Step 3: Freeze Original Model Parameters
- The large pre-trained model weights are kept fixed.
- Only LoRA parameters (A and B) are trained.
Step 4: Training with LoRA
- Input passes through the transformer as usual.
- Instead of computing WxWxWx, model computes:
y=Wx+BAxy = Wx + BAxy=Wx+BAx
- Backpropagation updates only A and B.
Step 5: Save and Deploy
- At inference, the final weight is:
W′=W+BAW’ = W + BAW′=W+BA
- Storage-efficient: Only LoRA adapters (A and B matrices) are saved (few MBs instead of GBs).
- They can be merged with base model or loaded as adapters.
4. Advantages of LoRA
✅ Massive efficiency → 10–100x fewer trainable parameters.
✅ No need to store full fine-tuned models → just adapters.
✅ Composable → multiple LoRA adapters can be loaded (e.g., medical + legal).
✅ Performance → nearly matches full fine-tuning.
5. Example: Training LLaMA with LoRA
- Base model: LLaMA-7B (7 billion parameters).
- Full fine-tuning: ~14GB VRAM needed.
- With LoRA (rank=8): only ~100MB trainable parameters.
- Outcome: 95–98% of full fine-tuning performance with fraction of cost.
6. Limitations of LoRA
❌ Works best for text-based transformers; less effective for vision models.
❌ Rank selection (r) affects trade-off between performance & efficiency.
❌ Sometimes underperforms on extreme domain adaptation.
7. Use Cases
- Chatbot customization (finance, healthcare, education).
- Domain adaptation (legal, biomedical NLP).
- Multi-lingual tuning (adapters per language).
- Low-resource environments (deployable on consumer GPUs).
✅ Summary:
LoRA fine-tuning = freeze big model + inject small trainable adapters (A, B matrices) into key layers.
This gives efficient, modular, and cost-effective fine-tuning without retraining billions of parameters.