๐งฎ MoE Memory Estimator for Megatron-LM
Estimate GPU memory requirements for training Mixture-of-Experts (MoE) models with Megatron-LM. Based on validated formulas from actual training runs.
โ๏ธ Configuration
Example:
torchrun --nproc_per_node 4 --nnodes 1 \
pretrain_qwen.py \
--num-layers 16 \
--hidden-size 2048 \
--num-attention-heads 32 \
--num-query-groups 4 \
--num-experts 128 \
--moe-ffn-hidden-size 768 \
--micro-batch-size 1 \
--seq-length 4096 \
--tensor-model-parallel-size 1 \
--pipeline-model-parallel-size 1 \
--expert-model-parallel-size 4 \
--use-distributed-optimizer
Or configure parameters manually: