LearnFor DevelopersHow to Fine-tune LLMs
Tutorial
18 min read

How to Fine-tune LLMsComplete Guide with Code Examples

Learn to fine-tune large language models like Llama, Mistral, and GPT. Covers LoRA, QLoRA, and full fine-tuning with practical code and Griddly deployment.

3
Methods
LoRA, QLoRA, Full
8GB
Min VRAM
With QLoRA
$3-500
Cost
Depending on method
5+
Code
Examples included
G
Griddly Team
Updated December 2025

Overview

Fine-tuning adapts a pre-trained language model to your specific use case. Instead of training from scratch (which costs millions), you leverage existing knowledge and specialize it with your data.

This guide covers three approaches: LoRA (most popular), QLoRA (for limited VRAM), and Full Fine-tuning (best quality).

When to Fine-tune

  • Domain adaptation: Medical, legal, finance terminology
  • Style/tone: Match your brand voice
  • Task specialization: Specific format outputs
  • Knowledge injection: Company-specific information

Fine-tuning Methods

Choose based on your hardware, budget, and quality requirements:

LoRA

Low-Rank Adaptation

Fast

Train small adapter layers instead of full model weights. Most popular method.

VRAM
16-24GB
Quality
Good
Est. Cost
$5-20
Best For
Most use cases, limited hardware

QLoRA

Quantized LoRA

Medium

LoRA with 4-bit quantization. Train 70B models on consumer GPUs.

VRAM
8-16GB
Quality
Good
Est. Cost
$3-15
Best For
Large models on limited VRAM

Full Fine-tuning

All Parameters

Slow

Update all model weights. Best quality but requires significant resources.

VRAM
80GB+ (A100)
Quality
Best
Est. Cost
$50-500
Best For
Production models, maximum quality

LoRA Fine-tuning

LoRA (Low-Rank Adaptation) trains small adapter matrices instead of the full model. This reduces trainable parameters by 99%+ while maintaining quality.

LoRA Configuration
Recommended
from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load base model
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    torch_dtype=torch.float16,
    device_map="auto"
)

# Configure LoRA
lora_config = LoraConfig(
    r=16,                    # Rank (8-64 typical)
    lora_alpha=32,           # Scaling factor
    target_modules=[         # Which layers to adapt
        "q_proj", "k_proj", 
        "v_proj", "o_proj"
    ],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

# Apply LoRA
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Output: trainable params: 4,194,304 || all params: 6,742,609,920 
# || trainable%: 0.062

Advantages

  • Train on consumer GPUs (16-24GB)
  • Fast training (hours, not days)
  • Small adapter files (~50MB)
  • Easy to switch between adapters

Limitations

  • Slightly lower quality than full FT
  • May not capture all knowledge
  • Requires careful rank selection
  • Not ideal for major behavior changes

QLoRA (4-bit Quantization)

QLoRA combines LoRA with 4-bit quantization. Train 70B parameter models on a single 24GB GPU — something that normally requires 140GB+ VRAM.

QLoRA Setup
Low VRAM
from transformers import BitsAndBytesConfig
import torch

# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)

# Load quantized model
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-70b-hf",
    quantization_config=bnb_config,
    device_map="auto"
)

# Now apply LoRA on top
model = get_peft_model(model, lora_config)
# 70B model fine-tuning on 24GB VRAM!

VRAM Requirements with QLoRA

7B
~6GB
13B
~10GB
33B
~18GB
70B
~36GB

Full Fine-tuning

Full fine-tuning updates all model parameters. Best quality but requires significant resources — typically A100 80GB or multi-GPU setups.

Training Loop
from transformers import TrainingArguments, Trainer
from datasets import load_dataset

# Load your dataset
dataset = load_dataset("your_dataset")

# Training arguments
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    fp16=True,
    logging_steps=10,
    save_strategy="epoch",
    warmup_ratio=0.03,
    lr_scheduler_type="cosine",
)

# Create trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset["train"],
    tokenizer=tokenizer,
)

# Start training
trainer.train()

# Save LoRA adapter
model.save_pretrained("./lora-adapter")

When to Use Full Fine-tuning

  • • Production models requiring maximum quality
  • • Significant domain shift from base model
  • • Large training datasets (100K+ examples)
  • • Budget allows for A100/H100 compute

Data Preparation

Data quality is the biggest factor in fine-tuning success. Here's how to prepare your dataset:

Data Formats
# Instruction format (Alpaca style)
{
    "instruction": "Summarize the following text",
    "input": "The quick brown fox jumps over the lazy dog...",
    "output": "A fox jumps over a dog."
}

# Chat format (ChatML style)
{
    "messages": [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is machine learning?"},
        {"role": "assistant", "content": "Machine learning is..."}
    ]
}

# Simple completion format
{
    "text": "<s>[INST] What is AI? [/INST] AI stands for..."
}

Quality > Quantity

1,000 high-quality examples often beats 100,000 noisy ones.

Match Your Use Case

Train on data similar to your production inputs.

Diverse Examples

Include edge cases and variations in your training data.

Consistent Format

Use the same prompt template for training and inference.

Recommended Hyperparameters

ParameterTypical ValueNotes
Learning Rate1e-4 to 3e-4Lower for larger models
LoRA Rank (r)8-64Higher = more capacity, more VRAM
LoRA Alpha16-64Usually 2x rank
Batch Size4-32Use gradient accumulation if limited
Epochs1-5Watch for overfitting
Warmup3-10%Of total steps

Training on Griddly

Griddly provides affordable GPU compute for LLM fine-tuning. Access A100 80GB at 70% less than AWS.

Griddly CLI
# Install Griddly CLI
pip install griddly-cli

# Login
griddly login

# Create training job
griddly train create \
    --name "llama-finetune" \
    --gpu "A100-80GB" \
    --script "train.py" \
    --data "./dataset" \
    --output "./checkpoints"

# Monitor training
griddly train logs llama-finetune

# Download results
griddly train download llama-finetune ./results

A100 80GB

$0.80/hr

70% less than AWS

RTX 4090

$0.30/hr

Great for LoRA

Free Credits

$50

To get started

Best Practices & Common Mistakes

Common Mistakes to Avoid

Overfitting:Use validation set, early stopping, and fewer epochs.
Wrong learning rate:Start with 2e-4 for LoRA, 1e-5 for full fine-tuning.
Inconsistent prompts:Use exact same format for training and inference.
Too little data:Minimum 100-500 examples for LoRA, more for full fine-tuning.
Ignoring base model:Choose a base model already good at your task type.

Ready to Fine-tune?

Start fine-tuning your LLM on Griddly Cloud. A100 80GB from $0.80/hr with $50 free credits.