nanobot/ara

Fork 0

Files

T

nanobot b275af2b4d fix: dereference orchestra-skills submodule, add as plain files

2026-05-05 23:28:24 +02:00

11 KiB

Raw Blame History

Training Recipes

Complete hyperparameter configurations for LoRA, QLoRA, and full fine-tuning across different model sizes.

Overview

LitGPT provides optimized training configurations in config_hub/finetune/ for various model architectures and fine-tuning methods.

Key Configuration Files:

config_hub/finetune/*/lora.yaml - LoRA fine-tuning
config_hub/finetune/*/qlora.yaml - 4-bit quantized LoRA
config_hub/finetune/*/full.yaml - Full fine-tuning

LoRA Fine-tuning Recipes

TinyLlama 1.1B LoRA

Configuration:

global_batch_size: 8
micro_batch_size: 8
lr_warmup_steps: 10
epochs: 3
max_seq_length: 512

# LoRA specific
lora_r: 8
lora_alpha: 16
lora_dropout: 0.05

Command:

litgpt finetune_lora TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T \
  --data JSON \
  --data.json_path data/alpaca_sample.json \
  --train.global_batch_size 8 \
  --train.micro_batch_size 8 \
  --train.lr_warmup_steps 10 \
  --train.epochs 3 \
  --train.max_seq_length 512 \
  --lora_r 8 \
  --lora_alpha 16

Memory: ~4GB VRAM Time: ~30 minutes on RTX 3090

Llama 2 7B LoRA

Configuration:

global_batch_size: 8
micro_batch_size: 2
lr_warmup_steps: 10
epochs: 4
max_seq_length: 512

# LoRA specific
lora_r: 8
lora_alpha: 16
lora_dropout: 0.05

Command:

litgpt finetune_lora meta-llama/Llama-2-7b-hf \
  --data JSON \
  --data.json_path data/alpaca.json \
  --train.global_batch_size 8 \
  --train.micro_batch_size 2 \
  --train.lr_warmup_steps 10 \
  --train.epochs 4 \
  --lora_r 8 \
  --lora_alpha 16

Memory: ~16GB VRAM Gradient Accumulation: 4 steps (8 / 2) Time: ~6 hours on A100

Llama 3 8B LoRA

Configuration:

global_batch_size: 8
micro_batch_size: 1
lr_warmup_steps: 10
epochs: 2
max_seq_length: 512

# LoRA specific
lora_r: 8
lora_alpha: 16
lora_dropout: 0.05

Command:

litgpt finetune_lora meta-llama/Llama-3.2-8B \
  --data JSON \
  --data.json_path data/custom_dataset.json \
  --train.global_batch_size 8 \
  --train.micro_batch_size 1 \
  --train.lr_warmup_steps 10 \
  --train.epochs 2 \
  --lora_r 8

Memory: ~20GB VRAM Gradient Accumulation: 8 steps Time: ~8 hours on A100

Mistral 7B LoRA

Configuration:

global_batch_size: 8
micro_batch_size: 2
lr_warmup_steps: 10
epochs: 4
max_seq_length: 512

lora_r: 8
lora_alpha: 16

Command:

litgpt finetune_lora mistralai/Mistral-7B-v0.1 \
  --data JSON \
  --data.json_path data/alpaca.json \
  --train.global_batch_size 8 \
  --train.micro_batch_size 2 \
  --train.epochs 4 \
  --lora_r 8

Memory: ~16GB VRAM

Phi-2 LoRA

Configuration:

global_batch_size: 8
micro_batch_size: 4
lr_warmup_steps: 10
epochs: 1
max_seq_length: 512

lora_r: 8
lora_alpha: 16

Command:

litgpt finetune_lora microsoft/phi-2 \
  --data JSON \
  --data.json_path data/alpaca_sample.json \
  --train.global_batch_size 8 \
  --train.micro_batch_size 4 \
  --train.epochs 1 \
  --lora_r 8

Memory: ~8GB VRAM Time: ~20 minutes on RTX 3090

Falcon 7B LoRA

Configuration:

global_batch_size: 8
micro_batch_size: 1
lr_warmup_steps: 10
epochs: 4
max_seq_length: 512

lora_r: 8
lora_alpha: 16

Command:

litgpt finetune_lora tiiuae/falcon-7b \
  --data JSON \
  --data.json_path data/alpaca.json \
  --train.global_batch_size 8 \
  --train.micro_batch_size 1 \
  --train.epochs 4 \
  --lora_r 8

Memory: ~18GB VRAM

Gemma 7B LoRA

Configuration:

global_batch_size: 6
micro_batch_size: 1
lr_warmup_steps: 200
epochs: 2
max_seq_length: 512

lora_r: 8
lora_alpha: 16

Command:

litgpt finetune_lora google/gemma-7b \
  --data JSON \
  --data.json_path data/alpaca.json \
  --train.global_batch_size 6 \
  --train.micro_batch_size 1 \
  --train.lr_warmup_steps 200 \
  --train.epochs 2 \
  --lora_r 8

Memory: ~18GB VRAM Note: Longer warmup (200 steps) for stability

QLoRA Fine-tuning Recipes

QLoRA uses 4-bit quantization to reduce memory by ~75%.

TinyLlama 1.1B QLoRA

Configuration:

global_batch_size: 8
micro_batch_size: 8
lr_warmup_steps: 10
epochs: 3
max_seq_length: 512

lora_r: 8
lora_alpha: 16
quantize: "bnb.nf4"

Command:

litgpt finetune_lora TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T \
  --quantize bnb.nf4 \
  --data JSON \
  --data.json_path data/alpaca_sample.json \
  --train.global_batch_size 8 \
  --train.micro_batch_size 8 \
  --train.epochs 3 \
  --lora_r 8

Memory: ~2GB VRAM (75% reduction)

Llama 2 7B QLoRA

Configuration:

global_batch_size: 8
micro_batch_size: 2
lr_warmup_steps: 10
epochs: 4
max_seq_length: 512
min_lr: 6.0e-5

lora_r: 8
lora_alpha: 16
quantize: "bnb.nf4"

Command:

litgpt finetune_lora meta-llama/Llama-2-7b-hf \
  --quantize bnb.nf4 \
  --data JSON \
  --data.json_path data/alpaca.json \
  --train.global_batch_size 8 \
  --train.micro_batch_size 2 \
  --train.epochs 4 \
  --lora_r 8

Memory: ~6GB VRAM (consumer GPU friendly)

Llama 3 8B QLoRA

Configuration:

global_batch_size: 8
micro_batch_size: 2
lr_warmup_steps: 10
epochs: 2
max_seq_length: 512

lora_r: 8
lora_alpha: 16
quantize: "bnb.nf4"

Command:

litgpt finetune_lora meta-llama/Llama-3.2-8B \
  --quantize bnb.nf4 \
  --data JSON \
  --data.json_path data/custom_dataset.json \
  --train.global_batch_size 8 \
  --train.micro_batch_size 2 \
  --train.epochs 2 \
  --lora_r 8

Memory: ~8GB VRAM

Mistral 7B QLoRA

Configuration:

global_batch_size: 8
micro_batch_size: 2
lr_warmup_steps: 10
epochs: 4
max_seq_length: 512

lora_r: 8
lora_alpha: 16
quantize: "bnb.nf4"

Memory: ~6GB VRAM

Phi-2 QLoRA

Configuration:

global_batch_size: 8
micro_batch_size: 4
lr_warmup_steps: 10
epochs: 1
max_seq_length: 512

lora_r: 8
lora_alpha: 16
quantize: "bnb.nf4"

Memory: ~3GB VRAM

Falcon 7B QLoRA

Configuration:

global_batch_size: 8
micro_batch_size: 1
lr_warmup_steps: 10
epochs: 4
max_seq_length: 512

lora_r: 8
lora_alpha: 16
quantize: "bnb.nf4"

Memory: ~6GB VRAM

Gemma 2B QLoRA

Configuration:

global_batch_size: 6
micro_batch_size: 2
lr_warmup_steps: 200
epochs: 2
max_seq_length: 512

lora_r: 8
lora_alpha: 16
quantize: "bnb.nf4"

Memory: ~3GB VRAM

Gemma 7B QLoRA

Configuration:

global_batch_size: 6
micro_batch_size: 1
lr_warmup_steps: 200
epochs: 2
max_seq_length: 512

lora_r: 8
lora_alpha: 16
quantize: "bnb.nf4"

Memory: ~6GB VRAM

Full Fine-tuning Recipes

Full fine-tuning updates all model parameters (requires more memory).

TinyLlama 1.1B Full

Configuration:

global_batch_size: 8
micro_batch_size: 2
lr_warmup_steps: 100
epochs: 3
max_seq_length: 512
learning_rate: 5e-5

Command:

litgpt finetune_full TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T \
  --data JSON \
  --data.json_path data/alpaca.json \
  --train.global_batch_size 8 \
  --train.micro_batch_size 2 \
  --train.lr_warmup_steps 100 \
  --train.epochs 3 \
  --train.learning_rate 5e-5

Memory: ~12GB VRAM Time: ~4 hours on A100

Phi-2 Full

Configuration:

global_batch_size: 8
micro_batch_size: 1
lr_warmup_steps: 100
epochs: 2
max_seq_length: 512
learning_rate: 3e-5

Command:

litgpt finetune_full microsoft/phi-2 \
  --data JSON \
  --data.json_path data/alpaca.json \
  --train.global_batch_size 8 \
  --train.micro_batch_size 1 \
  --train.epochs 2 \
  --train.learning_rate 3e-5

Memory: ~24GB VRAM

Common Hyperparameter Patterns

Learning Rates

Model Size	LoRA LR	Full Fine-tune LR
<2B	3e-4	5e-5
2-10B	1e-4	3e-5
10-70B	5e-5	1e-5

LoRA Rank (r)

r=8: Default, good balance (recommended)
r=16: More capacity, 2× trainable params
r=32: Maximum capacity, slower training
r=4: Minimal, fastest training

Rule of thumb: Start with r=8, increase if underfitting.

Batch Sizes

GPU VRAM	Micro Batch	Global Batch
8GB	1	8
16GB	2	8-16
40GB	4	16-32
80GB	8	32-64

Warmup Steps

Small models (<2B): 10-50 steps
Medium models (2-10B): 100-200 steps
Large models (>10B): 200-500 steps

Epochs

Instruction tuning: 1-3 epochs
Domain adaptation: 3-5 epochs
Small datasets (<10K): 5-10 epochs

Advanced Configurations

Custom Learning Rate Schedule

litgpt finetune_lora meta-llama/Llama-2-7b-hf \
  --train.learning_rate 3e-4 \
  --train.lr_warmup_steps 100 \
  --train.min_lr 3e-6 \
  --train.lr_decay_iters 10000

Gradient Accumulation

# Simulate global_batch_size=128 with 16GB GPU
litgpt finetune_lora meta-llama/Llama-2-7b-hf \
  --train.global_batch_size 128 \
  --train.micro_batch_size 2
# Accumulates over 64 steps (128 / 2)

Mixed Precision

litgpt finetune_lora meta-llama/Llama-2-7b-hf \
  --precision bf16-mixed  # BF16 mixed precision
# or
  --precision 16-mixed  # FP16 mixed precision

Longer Context

litgpt finetune_lora meta-llama/Llama-3.1-8B \
  --train.max_seq_length 8192 \
  --train.micro_batch_size 1  # Reduce batch for memory

Memory Optimization

Out of Memory? Try These

Enable quantization:
```
--quantize bnb.nf4  # 4-bit QLoRA
```
Reduce batch size:
```
--train.micro_batch_size 1
```
Lower LoRA rank:
```
--lora_r 4  # Instead of 8
```

Use FSDP (multi-GPU):

litgpt finetune_lora meta-llama/Llama-2-7b-hf \
  --devices 4  # Use 4 GPUs with FSDP

Gradient checkpointing:
```
--train.gradient_accumulation_iters 16
```

Data Format

LitGPT expects JSON data in instruction format:

[
  {
    "instruction": "What is the capital of France?",
    "input": "",
    "output": "The capital of France is Paris."
  },
  {
    "instruction": "Translate to Spanish:",
    "input": "Hello world",
    "output": "Hola mundo"
  }
]

Load custom data:

litgpt finetune_lora meta-llama/Llama-2-7b-hf \
  --data JSON \
  --data.json_path data/my_dataset.json \
  --data.val_split_fraction 0.1  # 10% validation

Merge and Deploy

After fine-tuning, merge LoRA weights:

litgpt merge_lora checkpoints/meta-llama/Llama-2-7b-hf/final_lora.pth

Generate with merged model:

litgpt generate checkpoints/meta-llama/Llama-2-7b-hf-merged/ \
  --prompt "What is machine learning?"

Or serve via API:

litgpt serve checkpoints/meta-llama/Llama-2-7b-hf-merged/

References

Configuration hub: config_hub/finetune/
Fine-tuning tutorial: tutorials/finetune_*.md
Memory guide: tutorials/oom.md

11 KiB Raw Blame History Unescape Escape

Training Recipes

Overview

LoRA Fine-tuning Recipes

TinyLlama 1.1B LoRA

Llama 2 7B LoRA

Llama 3 8B LoRA

Mistral 7B LoRA

Phi-2 LoRA

Falcon 7B LoRA

Gemma 7B LoRA

QLoRA Fine-tuning Recipes

TinyLlama 1.1B QLoRA

Llama 2 7B QLoRA

Llama 3 8B QLoRA

Mistral 7B QLoRA

Phi-2 QLoRA

Falcon 7B QLoRA

Gemma 2B QLoRA

Gemma 7B QLoRA

Full Fine-tuning Recipes

TinyLlama 1.1B Full

Phi-2 Full

Common Hyperparameter Patterns

Learning Rates

LoRA Rank (r)

Batch Sizes

Warmup Steps

Epochs

Advanced Configurations

Custom Learning Rate Schedule

Gradient Accumulation

Mixed Precision

Longer Context

Memory Optimization

Out of Memory? Try These

Data Format

Merge and Deploy

References

11 KiB

Raw Blame History