How to Fine-tune LLMs: Complete Guide with Griddly

Overview

Fine-tuning adapts a pre-trained language model to your specific use case. Instead of training from scratch (which costs millions), you leverage existing knowledge and specialize it with your data.

This guide covers three approaches: LoRA (most popular), QLoRA (for limited VRAM), and Full Fine-tuning (best quality).

When to Fine-tune

Domain adaptation: Medical, legal, finance terminology
Style/tone: Match your brand voice
Task specialization: Specific format outputs
Knowledge injection: Company-specific information

Fine-tuning Methods

Choose based on your hardware, budget, and quality requirements:

LoRA

Low-Rank Adaptation

Fast

Train small adapter layers instead of full model weights. Most popular method.

VRAM

16-24GB

Quality

Good

Est. Cost

$5-20

Best For

Most use cases, limited hardware

QLoRA

Quantized LoRA

Medium

LoRA with 4-bit quantization. Train 70B models on consumer GPUs.

VRAM

8-16GB

Quality

Good

Est. Cost

$3-15

Best For

Large models on limited VRAM

Full Fine-tuning

All Parameters

Slow

Update all model weights. Best quality but requires significant resources.

VRAM

80GB+ (A100)

Quality

Best

Est. Cost

$50-500

Best For

Production models, maximum quality

LoRA Fine-tuning

LoRA (Low-Rank Adaptation) trains small adapter matrices instead of the full model. This reduces trainable parameters by 99%+ while maintaining quality.

LoRA Configuration

Recommended

from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load base model
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    torch_dtype=torch.float16,
    device_map="auto"
)

# Configure LoRA
lora_config = LoraConfig(
    r=16,                    # Rank (8-64 typical)
    lora_alpha=32,           # Scaling factor
    target_modules=[         # Which layers to adapt
        "q_proj", "k_proj", 
        "v_proj", "o_proj"
    ],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

# Apply LoRA
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Output: trainable params: 4,194,304 || all params: 6,742,609,920 
# || trainable%: 0.062

Advantages

Train on consumer GPUs (16-24GB)
Fast training (hours, not days)
Small adapter files (~50MB)
Easy to switch between adapters

Limitations

Slightly lower quality than full FT
May not capture all knowledge
Requires careful rank selection
Not ideal for major behavior changes

QLoRA (4-bit Quantization)

QLoRA combines LoRA with 4-bit quantization. Train 70B parameter models on a single 24GB GPU — something that normally requires 140GB+ VRAM.

QLoRA Setup

Low VRAM

from transformers import BitsAndBytesConfig
import torch

# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)

# Load quantized model
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-70b-hf",
    quantization_config=bnb_config,
    device_map="auto"
)

# Now apply LoRA on top
model = get_peft_model(model, lora_config)
# 70B model fine-tuning on 24GB VRAM!

VRAM Requirements with QLoRA

~6GB

13B

~10GB

33B

~18GB

70B

~36GB

Full Fine-tuning

Full fine-tuning updates all model parameters. Best quality but requires significant resources — typically A100 80GB or multi-GPU setups.

Training Loop

from transformers import TrainingArguments, Trainer
from datasets import load_dataset

# Load your dataset
dataset = load_dataset("your_dataset")

# Training arguments
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    fp16=True,
    logging_steps=10,
    save_strategy="epoch",
    warmup_ratio=0.03,
    lr_scheduler_type="cosine",
)

# Create trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset["train"],
    tokenizer=tokenizer,
)

# Start training
trainer.train()

# Save LoRA adapter
model.save_pretrained("./lora-adapter")

When to Use Full Fine-tuning

• Production models requiring maximum quality
• Significant domain shift from base model
• Large training datasets (100K+ examples)
• Budget allows for A100/H100 compute

Data Preparation

Data quality is the biggest factor in fine-tuning success. Here's how to prepare your dataset:

Data Formats

# Instruction format (Alpaca style)
{
    "instruction": "Summarize the following text",
    "input": "The quick brown fox jumps over the lazy dog...",
    "output": "A fox jumps over a dog."
}

# Chat format (ChatML style)
{
    "messages": [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is machine learning?"},
        {"role": "assistant", "content": "Machine learning is..."}
    ]
}

# Simple completion format
{
    "text": "<s>[INST] What is AI? [/INST] AI stands for..."
}

Quality > Quantity

1,000 high-quality examples often beats 100,000 noisy ones.

Match Your Use Case

Train on data similar to your production inputs.

Diverse Examples

Include edge cases and variations in your training data.

Consistent Format

Use the same prompt template for training and inference.

Recommended Hyperparameters

Parameter	Typical Value	Notes
Learning Rate	1e-4 to 3e-4	Lower for larger models
LoRA Rank (r)	8-64	Higher = more capacity, more VRAM
LoRA Alpha	16-64	Usually 2x rank
Batch Size	4-32	Use gradient accumulation if limited
Epochs	1-5	Watch for overfitting
Warmup	3-10%	Of total steps

Training on Griddly

Griddly provides affordable GPU compute for LLM fine-tuning. Access A100 80GB at 70% less than AWS.

Griddly CLI

# Install Griddly CLI
pip install griddly-cli

# Login
griddly login

# Create training job
griddly train create \
    --name "llama-finetune" \
    --gpu "A100-80GB" \
    --script "train.py" \
    --data "./dataset" \
    --output "./checkpoints"

# Monitor training
griddly train logs llama-finetune

# Download results
griddly train download llama-finetune ./results

A100 80GB

$0.80/hr

70% less than AWS

RTX 4090

$0.30/hr

Great for LoRA

Free Credits

$50

To get started

Best Practices & Common Mistakes

Common Mistakes to Avoid

Overfitting:Use validation set, early stopping, and fewer epochs.

Wrong learning rate:Start with 2e-4 for LoRA, 1e-5 for full fine-tuning.

Inconsistent prompts:Use exact same format for training and inference.

Too little data:Minimum 100-500 examples for LoRA, more for full fine-tuning.

Ignoring base model:Choose a base model already good at your task type.

Ready to Fine-tune?

Start fine-tuning your LLM on Griddly Cloud. A100 80GB from $0.80/hr with $50 free credits.

How to Fine-tune LLMsComplete Guide with Code Examples

Table of Contents

Overview

When to Fine-tune

Fine-tuning Methods

LoRA

QLoRA

Full Fine-tuning

LoRA Fine-tuning

Advantages

Limitations

QLoRA (4-bit Quantization)

VRAM Requirements with QLoRA

Full Fine-tuning

When to Use Full Fine-tuning

Data Preparation

Quality > Quantity

Match Your Use Case

Diverse Examples

Consistent Format

Recommended Hyperparameters

Training on Griddly

A100 80GB

RTX 4090

Free Credits

Best Practices & Common Mistakes

Common Mistakes to Avoid

Ready to Fine-tune?