Files
ara/orchestra-skills/06-post-training/miles/references/troubleshooting.md
T

5.7 KiB

miles Troubleshooting Guide

FP8 Training Issues

Issue: FP8 Training Collapse

Symptoms: Loss explodes, NaN values, reward collapses

Solutions:

  1. Use block scaling:
--fp8-recipe blockwise
export NVTE_FP8_BLOCK_SCALING_FP32_SCALES=1
  1. Enable R3 for MoE models:
--use-r3
  1. Reduce learning rate:
--lr 5e-7  # Reduce from 1e-6
  1. Warm up from BF16:
--warmup-steps 100
--warmup-precision bf16

Issue: FP8 vs BF16 Accuracy Gap

Symptoms: FP8 model underperforms BF16 baseline

Solutions:

  1. Use E4M3 format for activations:
--fp8-format e4m3
  1. Enable dynamic scaling:
--fp8-dynamic-scaling
  1. Skip sensitive layers:
--fp8-skip-layers "lm_head,embed"

Train-Inference Mismatch Issues

Issue: Policy Divergence

Symptoms: Model behavior differs between training and inference

Solutions:

  1. Enable Rollout Routing Replay:
--use-r3
  1. Use importance sampling correction:
--use-tis --tis-threshold 0.9
  1. Verify log probs match:
--verify-logprobs

Issue: Expert Routing Mismatch (MoE)

Symptoms: Different experts activated during train vs inference

Solutions:

  1. Enable R3:
--use-r3
--r3-buffer-size 1000
  1. Use deterministic routing:
--deterministic-expert-routing

INT4 Training Issues

Issue: INT4 Accuracy Degradation

Symptoms: Worse performance than BF16 or FP8

Solutions:

  1. Increase group size:
--int4-group-size 256  # Increase from 128
  1. Use mixed precision for sensitive layers:
--int4-skip-layers "lm_head,embed,layer_norm"
  1. Warm start from BF16:
--warmup-steps 100
--warmup-precision bf16
  1. Increase learning rate (INT4 often needs higher LR):
--lr 2e-6  # Increase from 1e-6

Issue: INT4 OOM Despite Expected Savings

Symptoms: Still running out of memory with INT4

Solutions:

  1. Verify environment variable:
export OPEN_TRAINING_INT4_FAKE_QAT_FLAG=1
  1. Check group size alignment:
# Group size must divide hidden dimension evenly
--int4-group-size 128  # Must divide hidden_size

Speculative RL Issues

Issue: Low Acceptance Rate

Symptoms: Draft model tokens frequently rejected

Solutions:

  1. Reduce lookahead:
--spec-lookahead 3  # Reduce from 5
  1. Update draft more frequently:
--online-sft-interval 5  # Reduce from 10
  1. Increase draft learning rate:
--draft-lr 1e-5  # Increase

Issue: Draft Model Drift

Symptoms: Acceptance rate drops over time

Solutions:

  1. Enable online SFT:
--online-sft-interval 5
  1. Use EMA for draft updates:
--draft-ema-decay 0.99
  1. Reinitialize draft periodically:
--reinit-draft-interval 1000

Issue: Speculative Training Slower Than Expected

Symptoms: Not achieving expected 25%+ speedup

Solutions:

  1. Verify draft model is small enough:
# Draft should be 1/4 to 1/10 size of target
  1. Check lookahead is optimal:
--spec-lookahead 5  # Sweet spot for most models
  1. Profile to find bottleneck:
--profile-speculative

Weight Synchronization Issues

Issue: Zero-Copy Sync Failures

Symptoms: Errors with CUDA IPC, weight corruption

Solutions:

  1. Verify CUDA IPC support:
nvidia-smi topo -m  # Check GPU topology
  1. Fall back to standard sync:
# Remove --use-zero-copy-sync
  1. Increase bucket size:
--sync-bucket-size 2147483648  # 2GB

Issue: Slow Weight Sync Despite Zero-Copy

Symptoms: Weight sync still slow

Solutions:

  1. Use colocated mode:
--colocate
  1. Enable async weight transfer:
--async-weight-sync

MoE-Specific Issues

Issue: Expert Load Imbalance

Symptoms: Some experts heavily loaded, others unused

Solutions:

  1. Enable load balancing loss:
--aux-loss-coef 0.01
  1. Use capacity factor:
--moe-capacity-factor 1.25

Issue: Expert Parallelism OOM

Symptoms: OOM with large MoE models

Solutions:

  1. Increase expert parallelism:
--expert-model-parallel-size 8  # Increase from 4
  1. Reduce batch size per GPU:
--micro-batch-size 1
  1. Enable expert offloading:
--offload-experts

Multi-Agent Issues

Issue: Co-Evolution Instability

Symptoms: Agents oscillate or one dominates

Solutions:

  1. Use alternating updates:
co_evolution:
  strategy: alternating
  1. Reduce co-evolution frequency:
--co-evolution-interval 20  # Increase from 10
  1. Add population diversity:
co_evolution:
  population_size: 4

Debugging Tips

Enable Verbose Logging

--log-level DEBUG
export MILES_DEBUG=1

Check FP8 Tensors

# Verify FP8 is active
for name, param in model.named_parameters():
    print(f"{name}: {param.dtype}")

Profile Training

--profile
--profile-dir /path/to/profile

Verify R3 Is Working

# Check routing is being recorded
sample = samples[0]
assert sample.rollout_routed_experts is not None
assert len(sample.rollout_routed_experts) > 0

Monitor GPU Memory

watch -n 1 nvidia-smi

Resources