All Products
Search
Document Center

Platform For AI:Estimate GPU memory for LLMs

Last Updated:Jun 11, 2026

The GPU memory required to deploy and fine-tune a Large Language Model (LLM) depends on factors such as the number of parameters, precision, and sequence length. You can use the calculator in this topic to quickly estimate your GPU memory needs and select the right GPU specification.

1. Simple GPU memory calculator

Note
  • This topic estimates the GPU memory required for LLM deployment and fine-tuning based on common calculation methods. The actual GPU memory usage may differ due to variations in model network structures and algorithms.

  • For Mixture-of-Experts (MoE) models, such as DeepSeek-R1-671B, all 671B model parameters must be loaded. However, only 37B parameters are activated during inference. Therefore, you must calculate the GPU memory for activation values based on the 37B parameter count.

  • During model fine-tuning, model parameters, activation values, and gradients are typically stored in 16-bit precision. The Adam/AdamW optimizer is used, and its state is stored in 32-bit precision.

Inference

Title


ScenarioRequired GPU memory (GB)
Inference (16-bit)-
Inference (8-bit)-
Inference (4-bit)-

Fine-tuning


ScenarioRequired GPU memory (GB)
Full fine-tuning-
LoRA fine-tuning-
QLoRA (8-bit) fine-tuning-
QLoRA (4-bit) fine-tuning-

2. Factors that affect GPU memory for model inference

The GPU memory required for model inference consists of the following main components:

2.1 Model parameters

During model inference, the model parameters must be stored. The formula to calculate the GPU memory they occupy is: Number of parameters × Parameter precision. Common precisions are FP32 (4 bytes), FP16 (2 bytes), and BF16 (2 bytes). For LLMs, parameters are typically stored in FP16 or BF16. For example, for a 7B model with FP16 precision, the required GPU memory is:

2.2 Activation values

During LLM inference, the activation value of each neuron layer must be calculated. The GPU memory this occupies is positively correlated with the batch size, sequence length, and model architecture, such as the number of layers and hidden layer size. The relationship can be expressed as:

Where:

  • b (batch size): The batch size for a single request. This is typically 1 for online services but can be greater than 1 for batch processing.

  • s (sequence length): The total sequence length, including input and output (number of tokens).

  • h (hidden size): The hidden layer dimension of the model.

  • L (Layers): The number of Transformer layers in the model.

  • param_bytes: The precision for storing activation values, typically 2 bytes.

Based on these factors and practical experience, you can simplify the estimation. For a 7B model with a batch size (b) of 1, a sequence length (s) of 2048, and param_bytes of 2, the memory for activation values can be roughly estimated as 10% of the model's memory footprint. This provides a safe margin. The calculation is: .

2.3 KV cache

To accelerate LLM inference, the key (K) and value (V) matrices calculated for each Transformer layer are typically cached. This avoids recalculating the attention mechanism parameters for all previous tokens at each time step. Using a KV cache reduces the computational complexity from to , which significantly improves the inference speed. Similar to activation values, the GPU memory occupied by the KV cache is positively correlated with the batch size, sequence length, concurrency, and model architecture, such as the number of layers and hidden layer size. The relationship can be expressed as:

Where:

  • 2: Represents the two matrices that need to be stored: K (Key) and V (Value).

  • b (batch size): The batch size for a single request. This is typically 1 for online services but can be greater than 1 for batch processing.

  • s (sequence length): The total sequence length, including input and output (number of tokens).

  • h (hidden size): The hidden layer dimension of the model.

  • L (Layers): The number of Transformer layers in the model.

  • C (Concurrency): The concurrency of service interface requests.

  • param_bytes: The precision for storing activation values, typically 2 bytes.

Based on these factors and practical experience, you can simplify the estimation. For a 7B model with a concurrency (C) of 1, a batch size (b) of 1, a sequence length (s) of 2048, and param_bytes of 2, the memory for the KV cache can be roughly estimated as 10% of the model's memory footprint. This provides a safe margin. The value is: .

2.4 Other factors

In addition to the factors above, the input data for the current batch, CUDA cores, and the deep learning framework itself, such as PyTorch or TensorFlow, also consume some GPU memory, typically 1 GB to 2 GB.

Based on this analysis, the minimum GPU memory required to deploy a 7B LLM for inference is approximately:

3. Factors that affect GPU memory for model fine-tuning

The GPU memory required for model fine-tuning consists of the following main components:

3.1 Model parameters

During fine-tuning, the model parameters must be stored. The formula to calculate the GPU memory they occupy is: Number of parameters × Parameter precision. Common precisions are FP32 (4 bytes), FP16 (2 bytes), and BF16 (2 bytes). For LLM fine-tuning, parameters are typically stored in FP16 or BF16. For example, for a 7B model with FP16 precision, the required GPU memory is:

3.2 Gradient parameters

During the backward propagation phase of model training, gradients must be calculated for the model parameters. The number of gradients is the same as the number of parameters being trained. In LLMs, gradients are typically stored with 2-byte precision. Therefore, for a 7B model, the GPU memory required for gradients varies based on the fine-tuning method:

Fine-tuning method

Training mechanism

Scenarios

GPU memory for gradients (7B model, 1% parameters, 2-byte storage)

Full-parameter fine-tuning

The number of trainable parameters is the same as the number of model parameters.

High-precision requirements with sufficient computing power

14 GB

LoRA (low-rank adaptation)

LoRA fine-tuning freezes the original model parameters and trains only the low-rank matrices. The number of trainable parameters depends on the model structure and the size of the low-rank matrices, typically accounting for 0.1% to 1% of the total model parameters.

Adapting to specific tasks with limited resources

0.14 GB

QLoRA (Quantization + LoRA)

The pre-trained model is compressed to 4-bit or 8-bit. The model is then fine-tuned using LoRA. Double quantization and paged optimizers are introduced to further reduce GPU memory usage. The number of trainable parameters is typically 0.1% to 1% of the total model parameters.

Fine-tuning very large-scale models

0.14 GB

3.3 Optimizer state

During training, the state of the optimizer must also be saved. The number of state values is related to the number of trainable parameters. Additionally, models often use mixed-precision training. This means model parameters and gradients are stored in 2-byte precision, while the optimizer state is stored in 4-byte precision. This practice ensures high precision during parameter updates and prevents numerical instability or overflow caused by the limited dynamic range of FP16/BF16. Storing the state in 4-byte precision doubles the required GPU memory. The following table describes common optimizers:

Optimizer type

Parameter update mechanism

Additional storage requirement

(each trainable parameter)

Scenarios

GPU memory for optimizer state (7B model, 4-byte storage)

Full-parameter fine-tuning

LoRA fine-tuning (1% parameters)

QLoRA fine-tuning (1% parameters)

SGD

Uses only the current gradient

0 (no additional state)

Small models or experiments

0

0

0

SGD + Momentum

Includes momentum term

1 floating-point number (momentum)

Better stability

28 GB

0.28 GB

0.28 GB

RMSProp

Adaptive learning rate

1 floating-point number (second moment)

Non-convex optimization

28 GB

0.28 GB

0.28 GB

Adam/AdamW

Momentum + Adaptive learning rate

2 floating-point numbers (first and second moments)

Common for LLMs

56 GB

0.56 GB

0.56 GB

3.4 Activation values

During training, intermediate activation values from the forward propagation pass must be stored to calculate gradients during backward propagation. This memory consumption is positively correlated with the batch size, sequence length, and model architecture, such as the number of layers and hidden layer size. The relationship can be expressed as:

Where:

  • b (batch size): The batch size.

  • s (sequence length): The total sequence length, including input and output (number of tokens).

  • h (hidden size): The hidden layer dimension of the model.

  • L (Layers): The number of Transformer layers in the model.

  • param_bytes: The precision for storing activation values, typically 2 bytes.

Based on these factors and practical experience, you can simplify the estimation. For a 7B model with a batch size (b) of 1, a sequence length (s) of 2048, and param_bytes of 2, the memory for activation values can be roughly estimated as 10% of the model's memory footprint. This provides a safe margin. The calculation is: .

3.5 Other factors

In addition to the factors above, the input data for the current batch, CUDA cores, and the deep learning framework itself, such as PyTorch or TensorFlow, also consume some GPU memory, typically 1 GB to 2 GB.

Based on this analysis, the approximate GPU memory required for fine-tuning a 7B LLM is:

Fine-tuning method

Model GPU memory requirements

Gradient memory

Adam optimizer state

Activation values

Others

Total

Full-parameter fine-tuning

14 GB

14 GB

56 GB

1.4 GB

2 GB

87.4 GB

LoRA (low-rank adaptation)

14 GB

0.14 GB

0.56 GB

1.4 GB

2 GB

18.1 GB

QLoRA (8-bit quantization + LoRA)

7 GB

0.14 GB

0.56 GB

1.4 GB

2 GB

11.1 GB

QLoRA (4-bit quantization + LoRA)

3.5 GB

0.14 GB

0.56 GB

1.4 GB

2 GB

7.6 GB

Note
  1. LLMs typically use the Adam/AdamW optimizer.

  2. In the table, QLoRA model parameters are stored in 4-bit or 8-bit precision, and the optimizer state is stored in 32-bit (4-byte) precision. All other parameters are stored in 16-bit (2-byte) precision.

4. FAQ

Q: How do I check the number of parameters in an LLM?

A: For open source LLMs, the number of parameters is usually indicated in the model name. For example, Qwen-7B has parameters. Qwen3-235B-A22B has a total of parameters, with parameters activated during inference. For models that do not specify the parameter count in their name, you can search for and review the model's documentation to find this information.

Q: How do I check the parameter precision of an LLM?

A: Unless otherwise specified, LLMs typically use 16-bit (2-byte) storage. For quantized models, 8-bit or 4-bit storage may be used. For more information, see the model's documentation. For example, if you use a model from the PAI Model Gallery, the product page usually describes the parameter precision:

Qwen2.5-7B-Instruct training instructions:

image

Q: How do I check the optimizer and its state precision for LLM fine-tuning?

A: LLM training typically uses the Adam/AdamW optimizer with 32-bit (4-byte) parameter precision. For more detailed configurations, you can check the start command or the code.

Q: How do I check GPU memory usage?

A: You can view GPU memory usage on the graphical monitoring pages of PAI-DSW, PAI-EAS, or PAI-DLC:

image

Alternatively, you can run the nvidia-smi command in the container's terminal to check GPU usage:

image

Q: What are common "out of memory" errors?

A: When you run out of GPU memory on an NVIDIA GPU, a CUDA out of memory. Tried to allocate X GB error occurs. When this happens, you must increase the GPU memory or reduce parameters, such as the batch size or sequence length.