What is Distributed Training?
Distributed training splits the work of training a neural network across multiple GPUs or machines. This enables training larger models and reduces training time from weeks to days.
Why Distributed Training?
- • Train models too large for single GPU memory
- • Reduce training time (8 GPUs ≈ 6-7x speedup)
- • Use larger batch sizes for better convergence
- • Scale to billion+ parameter models
Data Parallelism
The simplest form of distributed training. Each GPU has a copy of the full model and processes different batches of data. Gradients are synchronized after each step.
# PyTorch DDP Example
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
# Initialize process group
dist.init_process_group(backend='nccl')
# Wrap model with DDP
model = DDP(model, device_ids=[local_rank])
# Training loop (same as single GPU)
for batch in dataloader:
loss = model(batch)
loss.backward()
optimizer.step()Pros
- • Easy to implement
- • Works with any model architecture
- • Near-linear scaling to 8-16 GPUs
Cons
- • Model must fit in GPU memory
- • Communication overhead at scale
- • Memory redundancy
Model Parallelism
When models are too large for a single GPU, you can split the model across multiple GPUs. Different strategies exist depending on how you split.
Data Parallelism
Same model on each GPU, different data batches
Model Parallelism
Model split across GPUs, same data
Pipeline Parallelism
Model layers split across GPUs in pipeline
Tensor Parallelism
Individual layers split across GPUs
Frameworks & Tools
| Framework | Use Case | Difficulty |
|---|---|---|
| PyTorch DDP | Data parallelism | Easy |
| DeepSpeed | ZeRO optimization | Medium |
| FSDP | Fully sharded training | Medium |
| Megatron-LM | Large LLM training | Hard |
| Horovod | Multi-framework | Medium |
DeepSpeed ZeRO Configuration
# DeepSpeed ZeRO-3 Config
{
"zero_optimization": {
"stage": 3,
"offload_param": { "device": "cpu" },
"offload_optimizer": { "device": "cpu" }
},
"fp16": { "enabled": true },
"gradient_accumulation_steps": 4
}Best Practices
Distributed Training on Griddly
Griddly Cloud provides multi-GPU instances with NVLink interconnects, perfect for distributed training. Scale from 1 to 8 GPUs instantly.