LLM Training Memory Visualizer

_{🔧 Built by Ruben Aghayan}

This calculator will estimate a coarse upper bound for memory used per GPU during training (excluding intermediates)

How to Use

Use Presets OR Adjust the parallelism, model, and training panels to match your run.
Press Calculate to refresh the memory breakdown chart.
Review the details and references below for context on the estimates.

Parallelism

Tensor Parallelism

Pipeline Parallelism

Context Parallelism

Expert Parallelism

FSDP (Fully Sharded Data Parallel)

FSDP Parallelism

FSDP Strategy

Zero-1 Zero-2 Zero-3

Model Architecture

Number of Layers

Vocab Size

Hidden Dim

Intermediate Dim

Mixture of Experts (MoE)

Active Experts

Total Experts

Weight Tied Embeddings

Presets

Training Config

Sequence Length

Batch Size

If you are using gradient accumulation, enter microbatch size

Gradient Checkpointing

Gradient Accumulation

Precision

Mixed Precision

Parameter Dtype

Reduce Dtype

Memory Usage Breakdown

Details

Key Assumptions:

Standard transformer architecture with homogeneous layers
Adam optimizer
Mixed precision keeps master weights copy
Tensor parallelism includes sequence parallelism
Pipeline parallelism maintains consistent activation memory due to schedule

Not Currently Supported:

Non-standard architectures (alternating dense/sparse layers, custom attention)
Multi-modal models with vision layers
Non-homogeneous parameter dtypes (e.g. BF16 & MXFP4 in GPT-OSS). Mixed Precision is supported.
Kernel/framework overhead and intermediate memory

For advanced configurations, results should be validated against profiling.

Motivation

Existing tools like the Hugging Face Model Memory Estimator, DeepSpeed Calculator, and DeepSpeed Native Utility are valuable but don't support the full range of modern training configurations.

This tool adds:

Arbitrary model configurations beyond preset architectures
FSDP and 5D parallelism support
Interactive memory breakdowns by category to inform configuration decisions

References

Helpful resources used while building this:

Validation

I validated this calculator against the projected memory usage in The Ultra-Scale Playbook w/in 10%. Some overage is expected since the calculator makes pessimistic assumptions and looks for peak memory. Note that you could still OOM from intermediates! Welcome any detailed memory usage reports along with configurations and framework details to tune this further!