nanobot/ara

Fork 0

Files

T

nanobot b275af2b4d fix: dereference orchestra-skills submodule, add as plain files

2026-05-05 23:28:24 +02:00

5.7 KiB

Raw Blame History

miles Troubleshooting Guide

FP8 Training Issues

Issue: FP8 Training Collapse

Symptoms: Loss explodes, NaN values, reward collapses

Solutions:

Use block scaling:

--fp8-recipe blockwise
export NVTE_FP8_BLOCK_SCALING_FP32_SCALES=1

Enable R3 for MoE models:

--use-r3

Reduce learning rate:

--lr 5e-7  # Reduce from 1e-6

Warm up from BF16:

--warmup-steps 100
--warmup-precision bf16

Issue: FP8 vs BF16 Accuracy Gap

Symptoms: FP8 model underperforms BF16 baseline

Solutions:

Use E4M3 format for activations:

--fp8-format e4m3

Enable dynamic scaling:

--fp8-dynamic-scaling

Skip sensitive layers:

--fp8-skip-layers "lm_head,embed"

Train-Inference Mismatch Issues

Issue: Policy Divergence

Symptoms: Model behavior differs between training and inference

Solutions:

Enable Rollout Routing Replay:

--use-r3

Use importance sampling correction:

--use-tis --tis-threshold 0.9

Verify log probs match:

--verify-logprobs

Issue: Expert Routing Mismatch (MoE)

Symptoms: Different experts activated during train vs inference

Solutions:

Enable R3:

--use-r3
--r3-buffer-size 1000

Use deterministic routing:

--deterministic-expert-routing

INT4 Training Issues

Issue: INT4 Accuracy Degradation

Symptoms: Worse performance than BF16 or FP8

Solutions:

Increase group size:

--int4-group-size 256  # Increase from 128

Use mixed precision for sensitive layers:

--int4-skip-layers "lm_head,embed,layer_norm"

Warm start from BF16:

--warmup-steps 100
--warmup-precision bf16

Increase learning rate (INT4 often needs higher LR):

--lr 2e-6  # Increase from 1e-6

Issue: INT4 OOM Despite Expected Savings

Symptoms: Still running out of memory with INT4

Solutions:

Verify environment variable:

export OPEN_TRAINING_INT4_FAKE_QAT_FLAG=1

Check group size alignment:

# Group size must divide hidden dimension evenly
--int4-group-size 128  # Must divide hidden_size

Speculative RL Issues

Issue: Low Acceptance Rate

Symptoms: Draft model tokens frequently rejected

Solutions:

Reduce lookahead:

--spec-lookahead 3  # Reduce from 5

Update draft more frequently:

--online-sft-interval 5  # Reduce from 10

Increase draft learning rate:

--draft-lr 1e-5  # Increase

Issue: Draft Model Drift

Symptoms: Acceptance rate drops over time

Solutions:

Enable online SFT:

--online-sft-interval 5

Use EMA for draft updates:

--draft-ema-decay 0.99

Reinitialize draft periodically:

--reinit-draft-interval 1000

Issue: Speculative Training Slower Than Expected

Symptoms: Not achieving expected 25%+ speedup

Solutions:

Verify draft model is small enough:

# Draft should be 1/4 to 1/10 size of target

Check lookahead is optimal:

--spec-lookahead 5  # Sweet spot for most models

Profile to find bottleneck:

--profile-speculative

Weight Synchronization Issues

Issue: Zero-Copy Sync Failures

Symptoms: Errors with CUDA IPC, weight corruption

Solutions:

Verify CUDA IPC support:

nvidia-smi topo -m  # Check GPU topology

Fall back to standard sync:

# Remove --use-zero-copy-sync

Increase bucket size:

--sync-bucket-size 2147483648  # 2GB

Issue: Slow Weight Sync Despite Zero-Copy

Symptoms: Weight sync still slow

Solutions:

Use colocated mode:

--colocate

Enable async weight transfer:

--async-weight-sync

MoE-Specific Issues

Issue: Expert Load Imbalance

Symptoms: Some experts heavily loaded, others unused

Solutions:

Enable load balancing loss:

--aux-loss-coef 0.01

Use capacity factor:

--moe-capacity-factor 1.25

Issue: Expert Parallelism OOM

Symptoms: OOM with large MoE models

Solutions:

Increase expert parallelism:

--expert-model-parallel-size 8  # Increase from 4

Reduce batch size per GPU:

--micro-batch-size 1

Enable expert offloading:

--offload-experts

Multi-Agent Issues

Issue: Co-Evolution Instability

Symptoms: Agents oscillate or one dominates

Solutions:

Use alternating updates:

co_evolution:
  strategy: alternating

Reduce co-evolution frequency:

--co-evolution-interval 20  # Increase from 10

Add population diversity:

co_evolution:
  population_size: 4

Debugging Tips

Enable Verbose Logging

--log-level DEBUG
export MILES_DEBUG=1

Check FP8 Tensors

# Verify FP8 is active
for name, param in model.named_parameters():
    print(f"{name}: {param.dtype}")

Profile Training

--profile
--profile-dir /path/to/profile

Verify R3 Is Working

# Check routing is being recorded
sample = samples[0]
assert sample.rollout_routed_experts is not None
assert len(sample.rollout_routed_experts) > 0

Monitor GPU Memory

watch -n 1 nvidia-smi

Resources

GitHub Issues: https://github.com/radixark/miles/issues
Unified FP8 Blog: https://lmsys.org/blog/2025-11-25-fp8-rl/
Train-Inference Mismatch Tutorial: https://github.com/zhaochenyang20/Awesome-ML-SYS-Tutorial/blob/main/rlhf/slime/mismatch/blog-en.md
SGLang Discord: Community support

5.7 KiB Raw Blame History

miles Troubleshooting Guide

FP8 Training Issues

Issue: FP8 Training Collapse

Issue: FP8 vs BF16 Accuracy Gap

Train-Inference Mismatch Issues

Issue: Policy Divergence

Issue: Expert Routing Mismatch (MoE)

INT4 Training Issues

Issue: INT4 Accuracy Degradation

Issue: INT4 OOM Despite Expected Savings

Speculative RL Issues

Issue: Low Acceptance Rate

Issue: Draft Model Drift

Issue: Speculative Training Slower Than Expected

Weight Synchronization Issues

Issue: Zero-Copy Sync Failures

Issue: Slow Weight Sync Despite Zero-Copy

MoE-Specific Issues

Issue: Expert Load Imbalance

Issue: Expert Parallelism OOM

Multi-Agent Issues

Issue: Co-Evolution Instability

Debugging Tips

Enable Verbose Logging

Check FP8 Tensors

Profile Training

Verify R3 Is Working

Monitor GPU Memory

Resources

5.7 KiB

Raw Blame History