Table of Contents
Overview
Fine-tuning adapts a pre-trained language model to your specific use case. Instead of training from scratch (which costs millions), you leverage existing knowledge and specialize it with your data.
This guide covers three approaches: LoRA (most popular), QLoRA (for limited VRAM), and Full Fine-tuning (best quality).
When to Fine-tune
- Domain adaptation: Medical, legal, finance terminology
- Style/tone: Match your brand voice
- Task specialization: Specific format outputs
- Knowledge injection: Company-specific information
Fine-tuning Methods
Choose based on your hardware, budget, and quality requirements:
LoRA
Low-Rank Adaptation
Train small adapter layers instead of full model weights. Most popular method.
QLoRA
Quantized LoRA
LoRA with 4-bit quantization. Train 70B models on consumer GPUs.
Full Fine-tuning
All Parameters
Update all model weights. Best quality but requires significant resources.
LoRA Fine-tuning
LoRA (Low-Rank Adaptation) trains small adapter matrices instead of the full model. This reduces trainable parameters by 99%+ while maintaining quality.
from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load base model
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
torch_dtype=torch.float16,
device_map="auto"
)
# Configure LoRA
lora_config = LoraConfig(
r=16, # Rank (8-64 typical)
lora_alpha=32, # Scaling factor
target_modules=[ # Which layers to adapt
"q_proj", "k_proj",
"v_proj", "o_proj"
],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
# Apply LoRA
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Output: trainable params: 4,194,304 || all params: 6,742,609,920
# || trainable%: 0.062Advantages
- Train on consumer GPUs (16-24GB)
- Fast training (hours, not days)
- Small adapter files (~50MB)
- Easy to switch between adapters
Limitations
- Slightly lower quality than full FT
- May not capture all knowledge
- Requires careful rank selection
- Not ideal for major behavior changes
QLoRA (4-bit Quantization)
QLoRA combines LoRA with 4-bit quantization. Train 70B parameter models on a single 24GB GPU — something that normally requires 140GB+ VRAM.
from transformers import BitsAndBytesConfig
import torch
# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
)
# Load quantized model
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-70b-hf",
quantization_config=bnb_config,
device_map="auto"
)
# Now apply LoRA on top
model = get_peft_model(model, lora_config)
# 70B model fine-tuning on 24GB VRAM!VRAM Requirements with QLoRA
Full Fine-tuning
Full fine-tuning updates all model parameters. Best quality but requires significant resources — typically A100 80GB or multi-GPU setups.
from transformers import TrainingArguments, Trainer
from datasets import load_dataset
# Load your dataset
dataset = load_dataset("your_dataset")
# Training arguments
training_args = TrainingArguments(
output_dir="./results",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-4,
fp16=True,
logging_steps=10,
save_strategy="epoch",
warmup_ratio=0.03,
lr_scheduler_type="cosine",
)
# Create trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=dataset["train"],
tokenizer=tokenizer,
)
# Start training
trainer.train()
# Save LoRA adapter
model.save_pretrained("./lora-adapter")When to Use Full Fine-tuning
- • Production models requiring maximum quality
- • Significant domain shift from base model
- • Large training datasets (100K+ examples)
- • Budget allows for A100/H100 compute
Data Preparation
Data quality is the biggest factor in fine-tuning success. Here's how to prepare your dataset:
# Instruction format (Alpaca style)
{
"instruction": "Summarize the following text",
"input": "The quick brown fox jumps over the lazy dog...",
"output": "A fox jumps over a dog."
}
# Chat format (ChatML style)
{
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is machine learning?"},
{"role": "assistant", "content": "Machine learning is..."}
]
}
# Simple completion format
{
"text": "<s>[INST] What is AI? [/INST] AI stands for..."
}Quality > Quantity
1,000 high-quality examples often beats 100,000 noisy ones.
Match Your Use Case
Train on data similar to your production inputs.
Diverse Examples
Include edge cases and variations in your training data.
Consistent Format
Use the same prompt template for training and inference.
Recommended Hyperparameters
| Parameter | Typical Value | Notes |
|---|---|---|
| Learning Rate | 1e-4 to 3e-4 | Lower for larger models |
| LoRA Rank (r) | 8-64 | Higher = more capacity, more VRAM |
| LoRA Alpha | 16-64 | Usually 2x rank |
| Batch Size | 4-32 | Use gradient accumulation if limited |
| Epochs | 1-5 | Watch for overfitting |
| Warmup | 3-10% | Of total steps |
Training on Griddly
Griddly provides affordable GPU compute for LLM fine-tuning. Access A100 80GB at 70% less than AWS.
# Install Griddly CLI
pip install griddly-cli
# Login
griddly login
# Create training job
griddly train create \
--name "llama-finetune" \
--gpu "A100-80GB" \
--script "train.py" \
--data "./dataset" \
--output "./checkpoints"
# Monitor training
griddly train logs llama-finetune
# Download results
griddly train download llama-finetune ./resultsA100 80GB
$0.80/hr
70% less than AWS
RTX 4090
$0.30/hr
Great for LoRA
Free Credits
$50
To get started