5.7 KiB
miles Troubleshooting Guide
FP8 Training Issues
Issue: FP8 Training Collapse
Symptoms: Loss explodes, NaN values, reward collapses
Solutions:
- Use block scaling:
--fp8-recipe blockwise
export NVTE_FP8_BLOCK_SCALING_FP32_SCALES=1
- Enable R3 for MoE models:
--use-r3
- Reduce learning rate:
--lr 5e-7 # Reduce from 1e-6
- Warm up from BF16:
--warmup-steps 100
--warmup-precision bf16
Issue: FP8 vs BF16 Accuracy Gap
Symptoms: FP8 model underperforms BF16 baseline
Solutions:
- Use E4M3 format for activations:
--fp8-format e4m3
- Enable dynamic scaling:
--fp8-dynamic-scaling
- Skip sensitive layers:
--fp8-skip-layers "lm_head,embed"
Train-Inference Mismatch Issues
Issue: Policy Divergence
Symptoms: Model behavior differs between training and inference
Solutions:
- Enable Rollout Routing Replay:
--use-r3
- Use importance sampling correction:
--use-tis --tis-threshold 0.9
- Verify log probs match:
--verify-logprobs
Issue: Expert Routing Mismatch (MoE)
Symptoms: Different experts activated during train vs inference
Solutions:
- Enable R3:
--use-r3
--r3-buffer-size 1000
- Use deterministic routing:
--deterministic-expert-routing
INT4 Training Issues
Issue: INT4 Accuracy Degradation
Symptoms: Worse performance than BF16 or FP8
Solutions:
- Increase group size:
--int4-group-size 256 # Increase from 128
- Use mixed precision for sensitive layers:
--int4-skip-layers "lm_head,embed,layer_norm"
- Warm start from BF16:
--warmup-steps 100
--warmup-precision bf16
- Increase learning rate (INT4 often needs higher LR):
--lr 2e-6 # Increase from 1e-6
Issue: INT4 OOM Despite Expected Savings
Symptoms: Still running out of memory with INT4
Solutions:
- Verify environment variable:
export OPEN_TRAINING_INT4_FAKE_QAT_FLAG=1
- Check group size alignment:
# Group size must divide hidden dimension evenly
--int4-group-size 128 # Must divide hidden_size
Speculative RL Issues
Issue: Low Acceptance Rate
Symptoms: Draft model tokens frequently rejected
Solutions:
- Reduce lookahead:
--spec-lookahead 3 # Reduce from 5
- Update draft more frequently:
--online-sft-interval 5 # Reduce from 10
- Increase draft learning rate:
--draft-lr 1e-5 # Increase
Issue: Draft Model Drift
Symptoms: Acceptance rate drops over time
Solutions:
- Enable online SFT:
--online-sft-interval 5
- Use EMA for draft updates:
--draft-ema-decay 0.99
- Reinitialize draft periodically:
--reinit-draft-interval 1000
Issue: Speculative Training Slower Than Expected
Symptoms: Not achieving expected 25%+ speedup
Solutions:
- Verify draft model is small enough:
# Draft should be 1/4 to 1/10 size of target
- Check lookahead is optimal:
--spec-lookahead 5 # Sweet spot for most models
- Profile to find bottleneck:
--profile-speculative
Weight Synchronization Issues
Issue: Zero-Copy Sync Failures
Symptoms: Errors with CUDA IPC, weight corruption
Solutions:
- Verify CUDA IPC support:
nvidia-smi topo -m # Check GPU topology
- Fall back to standard sync:
# Remove --use-zero-copy-sync
- Increase bucket size:
--sync-bucket-size 2147483648 # 2GB
Issue: Slow Weight Sync Despite Zero-Copy
Symptoms: Weight sync still slow
Solutions:
- Use colocated mode:
--colocate
- Enable async weight transfer:
--async-weight-sync
MoE-Specific Issues
Issue: Expert Load Imbalance
Symptoms: Some experts heavily loaded, others unused
Solutions:
- Enable load balancing loss:
--aux-loss-coef 0.01
- Use capacity factor:
--moe-capacity-factor 1.25
Issue: Expert Parallelism OOM
Symptoms: OOM with large MoE models
Solutions:
- Increase expert parallelism:
--expert-model-parallel-size 8 # Increase from 4
- Reduce batch size per GPU:
--micro-batch-size 1
- Enable expert offloading:
--offload-experts
Multi-Agent Issues
Issue: Co-Evolution Instability
Symptoms: Agents oscillate or one dominates
Solutions:
- Use alternating updates:
co_evolution:
strategy: alternating
- Reduce co-evolution frequency:
--co-evolution-interval 20 # Increase from 10
- Add population diversity:
co_evolution:
population_size: 4
Debugging Tips
Enable Verbose Logging
--log-level DEBUG
export MILES_DEBUG=1
Check FP8 Tensors
# Verify FP8 is active
for name, param in model.named_parameters():
print(f"{name}: {param.dtype}")
Profile Training
--profile
--profile-dir /path/to/profile
Verify R3 Is Working
# Check routing is being recorded
sample = samples[0]
assert sample.rollout_routed_experts is not None
assert len(sample.rollout_routed_experts) > 0
Monitor GPU Memory
watch -n 1 nvidia-smi
Resources
- GitHub Issues: https://github.com/radixark/miles/issues
- Unified FP8 Blog: https://lmsys.org/blog/2025-11-25-fp8-rl/
- Train-Inference Mismatch Tutorial: https://github.com/zhaochenyang20/Awesome-ML-SYS-Tutorial/blob/main/rlhf/slime/mismatch/blog-en.md
- SGLang Discord: Community support