All Products
Search
Document Center

Platform For AI:Estimate GPU memory for Large Language Models

Last Updated:May 27, 2026

Learn what factors determine GPU memory requirements for LLM deployment and fine-tuning, and how to estimate the memory needed.

Simple GPU memory estimator

Note
  • These estimates use general calculation methods. Actual GPU memory usage may differ due to model-specific network structures and algorithms.

  • For MoE models like DeepSeek-R1-671B, all 671B parameters must be loaded, but only 37B are activated during inference. Use the activated parameter count when calculating activation memory.

  • During fine-tuning, parameters, activations, and gradients typically use 16-bit storage. The Adam/AdamW optimizer stores states in 32-bit format.

Inference



ScenarioRequired GPU Memory (GB)
Inference (16-bit)-
Inference (8-bit)-
Inference (4-bit)-

Fine-tuning


ScenarioRequired GPU Memory (GB)
Full-parameter fine-tuning-
LoRA fine-tuning-
QLoRA (8-bit) fine-tuning-
QLoRA (4-bit) fine-tuning-

GPU memory components for inference

GPU memory for inference consists of:

Model parameters

Model parameters must reside in GPU memory during inference. The formula is: parameter count × parameter precision. Common precisions include FP32 (4 bytes), FP16 (2 bytes), and BF16 (2 bytes). LLMs typically use FP16 or BF16. For example, a 7B FP16 model requires:

Activation values

During inference, activation values for each layer must be calculated. Memory usage scales with batch size, sequence length, and model architecture (layer count and hidden size):

Where:

  • b (batch size): Batch size per request. Typically 1 for online services, higher for batch processing.

  • s (sequence length): Total token count (input + output).

  • h (hidden size): Hidden layer dimension.

  • L (Layers): Number of Transformer layers.

  • param_bytes: Storage precision for activations, typically 2 bytes.

In practice, for a 7B model with b=1, s=2048, and param_bytes=2, activation memory is approximately 10% of total model memory:.

KV cache

LLMs cache the Key (K) and Value (V) matrices for each Transformer layer to avoid recomputing attention for all historical tokens. This reduces computation from to, which significantly improves inference speed. KV cache memory scales with batch size, sequence length, concurrency, and model architecture:

Where:

  • 2: Two matrices stored (Key and Value).

  • b (batch size): Batch size per request. Typically 1 for online services, higher for batch processing.

  • s (sequence length): Total token count (input + output).

  • h (hidden size): Hidden layer dimension.

  • L (Layers): Number of Transformer layers.

  • C (Concurrency): Number of concurrent requests.

  • param_bytes: Storage precision, typically 2 bytes.

In practice, for a 7B model with C=1, b=1, s=2048, and param_bytes=2, KV cache memory is approximately 10% of total model memory:.

Other factors

Input data, CUDA cores, and the deep learning framework (PyTorch or TensorFlow) also consume 1-2 GB of GPU memory.

Minimum GPU memory for 7B LLM inference:

GPU memory components for fine-tuning

GPU memory for fine-tuning consists of:

Model parameters

Model parameters must reside in GPU memory during fine-tuning. The formula is: parameter count × parameter precision. Common precisions include FP32 (4 bytes), FP16 (2 bytes), and BF16 (2 bytes). LLMs typically use FP16 or BF16. For example, a 7B FP16 model requires:

Gradient parameters

During backpropagation, gradients are computed for each trainable parameter. LLMs typically store gradients in 2-byte precision. GPU memory for gradients varies by fine-tuning method:

Fine-tuning method

Training mechanism

Scenarios

GPU memory required for 7B model fine-tuning gradients (calculated with 1% parameters, 2-byte storage)

Full-parameter fine-tuning

Parameters to train are the same as the model's own parameters

High-precision requirements with sufficient computing power

14 GB

LoRA (low-rank adaptation)

LoRA fine-tuning freezes original model parameters and only trains low-rank matrices. The parameters to be trained depend on the model structure and the size of the low-rank matrices, typically accounting for about 0.1% to 1% of the total model parameters.

Low-resource adaptation for specific tasks

0.14 GB

QLoRA (quantization + LoRA)

Compresses the pre-trained model to 4-bit or 8-bit, uses LoRA to fine-tune the model, and introduces double quantization and paged optimizers, further reducing GPU memory usage. The parameters to be trained typically account for about 0.1% to 1% of the total model parameters.

Fine-tuning for ultra-large models

0.14 GB

Optimizer states

Optimizer states must be stored during training. Mixed-precision training stores parameters and gradients in 2 bytes but optimizer states in 4 bytes to prevent numerical instability from FP16/BF16's limited dynamic range. Common optimizers:

Optimizer type

Parameter update mechanism

Additional Storage Requirement

(each parameter to be trained)

Scenarios

GPU memory required for 7B model fine-tuning optimizer states (4-byte storage)

Full-parameter fine-tuning

LoRA fine-tuning (calculated with 1% parameters)

QLoRA fine-tuning (calculated with 1% parameters)

SGD

Uses only current gradients

0 (no additional states)

Small models or experiments

0

0

0

SGD + Momentum

With momentum term

1 floating-point number (momentum)

Better stability

28 GB

0.28 GB

0.28 GB

RMSProp

Adaptive learning rate

1 floating-point number (second moment)

Non-convex optimization

28 GB

0.28 GB

0.28 GB

Adam/AdamW

Momentum + adaptive learning rate

2 floating-point numbers (first + second moment)

Common for Large Language Models (LLMs)

56 GB

0.56 GB

0.56 GB

Activation values

Forward propagation generates intermediate activations needed for gradient computation during backpropagation. Memory scales with batch size, sequence length, and model architecture:

Where:

  • b (batch size): Training batch size.

  • s (sequence length): Total token count (input + output).

  • h (hidden size): Hidden layer dimension.

  • L (Layers): Number of Transformer layers.

  • param_bytes: Storage precision for activations, typically 2 bytes.

In practice, for a 7B model with b=1, s=2048, and param_bytes=2, activation memory is approximately 10% of total model memory:.

Other factors

Input data, CUDA cores, and the deep learning framework (PyTorch or TensorFlow) also consume 1-2 GB of GPU memory.

Approximate GPU memory for fine-tuning a 7B LLM:

Fine-tuning method

GPU memory required for the model itself

GPU memory required for gradients

Adam optimizer states

Activation values

Other

Total

Full-parameter fine-tuning

14 GB

14 GB

56 GB

1.4 GB

2 GB

87.4 GB

LoRA (low-rank adaptation)

14 GB

0.14 GB

0.56 GB

1.4 GB

2 GB

18.1 GB

QLoRA (8-bit quantization + LoRA)

7 GB

0.14 GB

0.56 GB

1.4 GB

2 GB

11.1 GB

QLoRA (4-bit quantization + LoRA)

3.5 GB

0.14 GB

0.56 GB

1.4 GB

2 GB

7.6 GB

Note
  1. LLMs typically use the Adam/AdamW optimizer.

  2. In the table, all parameters use 16-bit (2-byte) storage, except for QLoRA models that use 4-bit or 8-bit storage and optimizer states that use 32-bit (4-byte) storage.

FAQ

Q: How do I find an LLM's parameter count?

For open-source LLMs, the parameter count is usually in the model name. For example, Qwen-7B has a parameter count of, and Qwen3-235B-A22B has a total parameter count of and an activated parameter count of during inference. If the count is not in the name, check the model documentation.

Q: How do I find an LLM's parameter precision?

LLMs typically use 16-bit (2-byte) storage unless otherwise specified. Quantized models may use 8-bit or 4-bit. Check the model documentation for details. For PAI Model Gallery models, the product page shows the parameter precision:

Qwen2.5-7B-Instruct training instructions:

image

Q: How do I find the optimizer and state precision for fine-tuning?

LLM training typically uses Adam/AdamW with 32-bit (4-byte) parameter precision. Check the start command or training code for details.

Q: How do I view GPU memory usage?

View GPU memory on the monitoring page of PAI-DSW, PAI-EAS, or PAI-DLC:

image

Or run nvidia-smi in the container terminal:

image

Q: What are common out-of-GPU-memory errors?

Insufficient GPU memory triggers a CUDA out of memory. Tried to allocate X GB error. To resolve this, increase GPU memory or reduce the batch size, sequence length, or other parameters.