All Products
Search
Document Center

Platform For AI:Estimate GPU memory for Large Language Models

Last Updated:Mar 05, 2026

This topic describes the factors that affect the GPU memory required for Large Language Model (LLM) deployment and fine-tuning. It also explains how to estimate the required GPU memory.

Simple GPU memory estimator

Note
  • This topic estimates the GPU memory required for Large Language Model (LLM) deployment and fine-tuning based on general calculation methods. The estimated values may differ from actual GPU memory usage because different models have varying network structures and algorithms.

  • For a mixture-of-experts (MoE) model, such as DeepSeek-R1-671B, all 671B model parameters must be loaded. However, only 37B parameters are activated during inference. Therefore, when you calculate the GPU memory occupied by activation values, you must use the 37B model parameter count.

  • During model fine-tuning, models typically store parameters, activation values, and gradients in a 16-bit format. They use the Adam/AdamW optimizer and store optimizer states in a 32-bit format.

Inference



ScenarioRequired GPU Memory (GB)
Inference (16-bit)-
Inference (8-bit)-
Inference (4-bit)-

Fine-tuning


ScenarioRequired GPU Memory (GB)
Full-parameter fine-tuning-
LoRA fine-tuning-
QLoRA (8-bit) fine-tuning-
QLoRA (4-bit) fine-tuning-

Factors affecting GPU memory for model inference

The GPU memory required for model inference consists of the following main components:

Model parameters

First, the model's parameters must be stored during model inference. The formula for the occupied GPU memory is: parameter count × parameter precision. Common parameter precisions include FP32 (4 bytes), FP16 (2 bytes), and BF16 (2 bytes). For Large Language Models (LLMs), model parameters are typically stored in FP16 or BF16 format. For example, a 7B model with FP16 parameter precision requires the following amount of GPU memory:

Activation values

During Large Language Model (LLM) inference, the activation values for each neuron layer must be calculated. The occupied GPU memory is directly proportional to the batch size, sequence length, and model architecture, such as the number of layers and hidden layer size. The relationship is expressed as follows:

Where:

  • b (batch size): The batch size per request. This is typically 1 for online services and a value other than 1 for batch processing interfaces.

  • s (sequence length): The total sequence length, which includes the input and output token count.

  • h (hidden size): The dimension of the model's hidden layer.

  • L (Layers): The number of Transformer layers in the model.

  • param_bytes: The precision for storing activation values, which is typically 2 bytes.

Based on these factors and practical experience, you can simplify the GPU memory estimation. To allow for a buffer, for a 7B model where b is 1, s is 2048, and param_bytes is 2 bytes, you can estimate the GPU memory occupied by activation values as approximately 10% of the model's total occupied GPU memory. This is calculated as:.

KV cache

To accelerate the inference efficiency of a Large Language Model (LLM), the computed Key (K) and Value (V) for each Transformer layer are typically cached. This method avoids recomputing attention mechanism parameters for all historical tokens at each timestep. With the KV cache, the computation is reduced from to, which significantly improves the inference speed. Similar to activation values, the GPU memory usage of the KV cache is also directly proportional to the batch size, sequence length, concurrency, and model architecture, such as the number of layers and hidden layer size. The relationship is expressed as follows:

Where:

  • 2: Indicates the need to store two matrices, K (Key) and V (Value).

  • b (batch size): The batch size per request. This is typically 1 for online services and a value other than 1 for batch processing interfaces.

  • s (sequence length): The total sequence length, which includes the input and output token count.

  • h (hidden size): The dimension of the model's hidden layer.

  • L (Layers): The number of Transformer layers in the model.

  • C (Concurrency): The concurrency of service interface requests.

  • param_bytes: The precision for storing activation values, which is typically 2 bytes.

Based on these factors and practical experience, you can simplify the GPU memory estimation. To allow for a buffer, for a 7B model where C is 1, b is 1, s is 2048, and param_bytes is 2 bytes, you can estimate the GPU memory occupied by the KV cache as approximately 10% of the model's total occupied GPU memory. This is calculated as:.

Other factors

In addition to the factors mentioned, the input data for the current batch, CUDA cores, and the deep learning framework itself, such as PyTorch or TensorFlow, also consume some GPU memory. This is typically 1 GB to 2 GB.

Based on this analysis, the minimum GPU memory required for a 7B Large Language Model (LLM) inference deployment is approximately:

Factors affecting GPU memory for model fine-tuning

The GPU memory required for model fine-tuning consists of the following main components:

Model parameters

First, the model's parameters must be stored during fine-tuning. The formula for the occupied GPU memory is: parameter count × parameter precision. Common parameter precisions include FP32 (4 bytes), FP16 (2 bytes), and BF16 (2 bytes). For Large Language Models (LLMs), model parameters are typically stored in FP16 or BF16 format during fine-tuning. For example, a 7B model with FP16 parameter precision requires the following amount of GPU memory:

Gradient parameters

During the backward propagation phase of model training, gradients are calculated for the model parameters. The number of gradients is equal to the number of parameters to be trained. Large Language Models (LLMs) typically store gradients with 2-byte precision. Therefore, a 7B model requires the following amount of GPU memory, which varies depending on the fine-tuning method:

Fine-tuning method

Training mechanism

Scenarios

GPU memory required for 7B model fine-tuning gradients (calculated with 1% parameters, 2-byte storage)

Full-parameter fine-tuning

Parameters to train are the same as the model's own parameters

High-precision requirements with sufficient computing power

14 GB

LoRA (low-rank adaptation)

LoRA fine-tuning freezes original model parameters and only trains low-rank matrices. The parameters to be trained depend on the model structure and the size of the low-rank matrices, typically accounting for about 0.1% to 1% of the total model parameters.

Low-resource adaptation for specific tasks

0.14 GB

QLoRA (quantization + LoRA)

Compresses the pre-trained model to 4-bit or 8-bit, uses LoRA to fine-tune the model, and introduces double quantization and paged optimizers, further reducing GPU memory usage. The parameters to be trained typically account for about 0.1% to 1% of the total model parameters.

Fine-tuning for ultra-large models

0.14 GB

Optimizer states

During training, you must save the optimizer states. The number of state values is related to the number of parameters to be trained. Additionally, models typically use mixed-precision training. In this type of training, model parameters and gradients use 2-byte storage, and optimizer states use 4-byte storage. This method ensures high precision during parameter updates and prevents numerical instability or overflow that can be caused by the limited dynamic range of FP16/BF16. If you use 4-byte storage for states, the required GPU memory doubles. The common optimizers are as follows:

Optimizer type

Parameter update mechanism

Additional Storage Requirement

(each parameter to be trained)

Scenarios

GPU memory required for 7B model fine-tuning optimizer states (4-byte storage)

Full-parameter fine-tuning

LoRA fine-tuning (calculated with 1% parameters)

QLoRA fine-tuning (calculated with 1% parameters)

SGD

Uses only current gradients

0 (no additional states)

Small models or experiments

0

0

0

SGD + Momentum

With momentum term

1 floating-point number (momentum)

Better stability

28 GB

0.28 GB

0.28 GB

RMSProp

Adaptive learning rate

1 floating-point number (second moment)

Non-convex optimization

28 GB

0.28 GB

0.28 GB

Adam/AdamW

Momentum + adaptive learning rate

2 floating-point numbers (first + second moment)

Common for Large Language Models (LLMs)

56 GB

0.56 GB

0.56 GB

Activation values

During training, you must store the intermediate activation values that are generated during forward propagation to calculate gradients during backward propagation. This GPU memory consumption is directly proportional to the batch size, sequence length, and model architecture, such as the number of layers and hidden layer size. The relationship is expressed as follows:

Where:

  • b (batch size): The batch size.

  • s (sequence length): The total sequence length, which includes the input and output token count.

  • h (hidden size): The dimension of the model's hidden layer.

  • L (Layers): The number of Transformer layers in the model.

  • param_bytes: The precision for storing activation values, which is typically 2 bytes.

Based on these factors and practical experience, you can simplify the GPU memory estimation. To allow for a buffer, for a 7B model where b is 1, s is 2048, and param_bytes is 2 bytes, you can estimate the GPU memory occupied by activation values as approximately 10% of the model's total occupied GPU memory. This is calculated as:.

Other factors

In addition to the factors mentioned, the input data for the current batch, CUDA cores, and the deep learning framework itself, such as PyTorch or TensorFlow, also consume some GPU memory. This is typically 1 GB to 2 GB.

Based on this analysis, the approximate GPU memory required for fine-tuning a 7B Large Language Model (LLM) is as follows:

Fine-tuning method

GPU memory required for the model itself

GPU memory required for gradients

Adam optimizer states

Activation values

Other

Total

Full-parameter fine-tuning

14 GB

14 GB

56 GB

1.4 GB

2 GB

87.4 GB

LoRA (low-rank adaptation)

14 GB

0.14 GB

0.56 GB

1.4 GB

2 GB

18.1 GB

QLoRA (8-bit quantization + LoRA)

7 GB

0.14 GB

0.56 GB

1.4 GB

2 GB

11.1 GB

QLoRA (4-bit quantization + LoRA)

3.5 GB

0.14 GB

0.56 GB

1.4 GB

2 GB

7.6 GB

Note
  1. Large Language Models (LLMs) typically use the Adam/AdamW optimizer.

  2. In the table, all parameters use 16-bit (2-byte) storage, except for QLoRA models that use 4-bit or 8-bit storage and optimizer states that use 32-bit (4-byte) storage.

FAQ

Q: How do I view Large Language Model (LLM) parameter counts?

For open source Large Language Models (LLMs), the parameter count is usually indicated in the model name. For example, Qwen-7B has a parameter count of, and Qwen3-235B-A22B has a total parameter count of and an activated parameter count of during inference. For models where the parameter count is not explicitly included in the name, you can search for and review the model documentation to find this information.

Q: How do I view Large Language Model (LLM) parameter precision?

Unless otherwise specified, Large Language Models (LLMs) typically use 16-bit (2-byte) storage. For quantized models, they might use 8-bit or 4-bit storage. For more information, see the model's documentation. For example, if you use a model from the PAI Model Gallery, its product page usually describes the parameter precision:

Qwen2.5-7B-Instruct training instructions:

image

Q: How do I view the optimizer and state precision used for Large Language Model (LLM) fine-tuning?

Large Language Model (LLM) training typically uses the Adam/AdamW optimizer with 32-bit (4-byte) parameter precision. For more detailed configurations, you can check the start command or code.

Q: How do I view GPU memory usage?

You can view GPU memory usage on the graphical monitoring page of PAI-DSW, PAI-EAS, or PAI-DLC:

image

Alternatively, you can run the nvidia-smi command in the container's terminal to view GPU usage:

image

Q: What are common out-of-GPU-memory errors?

When the NVIDIA GPU memory is insufficient, you will receive a CUDA out of memory. Tried to allocate X GB error. In this case, you can increase the GPU memory or reduce the batch size, sequence length, or other parameters.