This topic describes the factors that affect the GPU memory required for Large Language Model (LLM) deployment and fine-tuning. It also explains how to estimate the required GPU memory.
Simple GPU memory estimator
-
This topic estimates the GPU memory required for Large Language Model (LLM) deployment and fine-tuning based on general calculation methods. The estimated values may differ from actual GPU memory usage because different models have varying network structures and algorithms.
-
For a mixture-of-experts (MoE) model, such as DeepSeek-R1-671B, all 671B model parameters must be loaded. However, only 37B parameters are activated during inference. Therefore, when you calculate the GPU memory occupied by activation values, you must use the 37B model parameter count.
-
During model fine-tuning, models typically store parameters, activation values, and gradients in a 16-bit format. They use the Adam/AdamW optimizer and store optimizer states in a 32-bit format.
Inference
| Scenario | Required GPU Memory (GB) |
|---|---|
| Inference (16-bit) | - |
| Inference (8-bit) | - |
| Inference (4-bit) | - |
Fine-tuning
| Scenario | Required GPU Memory (GB) |
|---|---|
| Full-parameter fine-tuning | - |
| LoRA fine-tuning | - |
| QLoRA (8-bit) fine-tuning | - |
| QLoRA (4-bit) fine-tuning | - |
Factors affecting GPU memory for model inference
The GPU memory required for model inference consists of the following main components:
Model parameters
First, the model's parameters must be stored during model inference. The formula for the occupied GPU memory is: parameter count × parameter precision. Common parameter precisions include FP32 (4 bytes), FP16 (2 bytes), and BF16 (2 bytes). For Large Language Models (LLMs), model parameters are typically stored in FP16 or BF16 format. For example, a 7B model with FP16 parameter precision requires the following amount of GPU memory:
Activation values
During Large Language Model (LLM) inference, the activation values for each neuron layer must be calculated. The occupied GPU memory is directly proportional to the batch size, sequence length, and model architecture, such as the number of layers and hidden layer size. The relationship is expressed as follows:
Where:
-
b (batch size): The batch size per request. This is typically 1 for online services and a value other than 1 for batch processing interfaces.
-
s (sequence length): The total sequence length, which includes the input and output token count.
-
h (hidden size): The dimension of the model's hidden layer.
-
L (Layers): The number of Transformer layers in the model.
-
param_bytes: The precision for storing activation values, which is typically 2 bytes.
Based on these factors and practical experience, you can simplify the GPU memory estimation. To allow for a buffer, for a 7B model where b is 1, s is 2048, and param_bytes is 2 bytes, you can estimate the GPU memory occupied by activation values as approximately 10% of the model's total occupied GPU memory. This is calculated as:
KV cache
To accelerate the inference efficiency of a Large Language Model (LLM), the computed Key (K) and Value (V) for each Transformer layer are typically cached. This method avoids recomputing attention mechanism parameters for all historical tokens at each timestep. With the KV cache, the computation is reduced from
Where:
-
2: Indicates the need to store two matrices, K (Key) and V (Value).
-
b (batch size): The batch size per request. This is typically 1 for online services and a value other than 1 for batch processing interfaces.
-
s (sequence length): The total sequence length, which includes the input and output token count.
-
h (hidden size): The dimension of the model's hidden layer.
-
L (Layers): The number of Transformer layers in the model.
-
C (Concurrency): The concurrency of service interface requests.
-
param_bytes: The precision for storing activation values, which is typically 2 bytes.
Based on these factors and practical experience, you can simplify the GPU memory estimation. To allow for a buffer, for a 7B model where C is 1, b is 1, s is 2048, and param_bytes is 2 bytes, you can estimate the GPU memory occupied by the KV cache as approximately 10% of the model's total occupied GPU memory. This is calculated as:
Other factors
In addition to the factors mentioned, the input data for the current batch, CUDA cores, and the deep learning framework itself, such as PyTorch or TensorFlow, also consume some GPU memory. This is typically 1 GB to 2 GB.
Based on this analysis, the minimum GPU memory required for a 7B Large Language Model (LLM) inference deployment is approximately:
Factors affecting GPU memory for model fine-tuning
The GPU memory required for model fine-tuning consists of the following main components:
Model parameters
First, the model's parameters must be stored during fine-tuning. The formula for the occupied GPU memory is: parameter count × parameter precision. Common parameter precisions include FP32 (4 bytes), FP16 (2 bytes), and BF16 (2 bytes). For Large Language Models (LLMs), model parameters are typically stored in FP16 or BF16 format during fine-tuning. For example, a 7B model with FP16 parameter precision requires the following amount of GPU memory:
Gradient parameters
During the backward propagation phase of model training, gradients are calculated for the model parameters. The number of gradients is equal to the number of parameters to be trained. Large Language Models (LLMs) typically store gradients with 2-byte precision. Therefore, a 7B model requires the following amount of GPU memory, which varies depending on the fine-tuning method:
|
Fine-tuning method |
Training mechanism |
Scenarios |
GPU memory required for 7B model fine-tuning gradients (calculated with 1% parameters, 2-byte storage) |
|
Full-parameter fine-tuning |
Parameters to train are the same as the model's own parameters |
High-precision requirements with sufficient computing power |
14 GB |
|
LoRA (low-rank adaptation) |
LoRA fine-tuning freezes original model parameters and only trains low-rank matrices. The parameters to be trained depend on the model structure and the size of the low-rank matrices, typically accounting for about 0.1% to 1% of the total model parameters. |
Low-resource adaptation for specific tasks |
0.14 GB |
|
QLoRA (quantization + LoRA) |
Compresses the pre-trained model to 4-bit or 8-bit, uses LoRA to fine-tune the model, and introduces double quantization and paged optimizers, further reducing GPU memory usage. The parameters to be trained typically account for about 0.1% to 1% of the total model parameters. |
Fine-tuning for ultra-large models |
0.14 GB |
Optimizer states
During training, you must save the optimizer states. The number of state values is related to the number of parameters to be trained. Additionally, models typically use mixed-precision training. In this type of training, model parameters and gradients use 2-byte storage, and optimizer states use 4-byte storage. This method ensures high precision during parameter updates and prevents numerical instability or overflow that can be caused by the limited dynamic range of FP16/BF16. If you use 4-byte storage for states, the required GPU memory doubles. The common optimizers are as follows:
|
Optimizer type |
Parameter update mechanism |
Additional Storage Requirement (each parameter to be trained) |
Scenarios |
GPU memory required for 7B model fine-tuning optimizer states (4-byte storage) |
||
|
Full-parameter fine-tuning |
LoRA fine-tuning (calculated with 1% parameters) |
QLoRA fine-tuning (calculated with 1% parameters) |
||||
|
SGD |
Uses only current gradients |
0 (no additional states) |
Small models or experiments |
0 |
0 |
0 |
|
SGD + Momentum |
With momentum term |
1 floating-point number (momentum) |
Better stability |
28 GB |
0.28 GB |
0.28 GB |
|
RMSProp |
Adaptive learning rate |
1 floating-point number (second moment) |
Non-convex optimization |
28 GB |
0.28 GB |
0.28 GB |
|
Adam/AdamW |
Momentum + adaptive learning rate |
2 floating-point numbers (first + second moment) |
Common for Large Language Models (LLMs) |
56 GB |
0.56 GB |
0.56 GB |
Activation values
During training, you must store the intermediate activation values that are generated during forward propagation to calculate gradients during backward propagation. This GPU memory consumption is directly proportional to the batch size, sequence length, and model architecture, such as the number of layers and hidden layer size. The relationship is expressed as follows:
Where:
-
b (batch size): The batch size.
-
s (sequence length): The total sequence length, which includes the input and output token count.
-
h (hidden size): The dimension of the model's hidden layer.
-
L (Layers): The number of Transformer layers in the model.
-
param_bytes: The precision for storing activation values, which is typically 2 bytes.
Based on these factors and practical experience, you can simplify the GPU memory estimation. To allow for a buffer, for a 7B model where b is 1, s is 2048, and param_bytes is 2 bytes, you can estimate the GPU memory occupied by activation values as approximately 10% of the model's total occupied GPU memory. This is calculated as:
Other factors
In addition to the factors mentioned, the input data for the current batch, CUDA cores, and the deep learning framework itself, such as PyTorch or TensorFlow, also consume some GPU memory. This is typically 1 GB to 2 GB.
Based on this analysis, the approximate GPU memory required for fine-tuning a 7B Large Language Model (LLM) is as follows:
|
Fine-tuning method |
GPU memory required for the model itself |
GPU memory required for gradients |
Adam optimizer states |
Activation values |
Other |
Total |
|
Full-parameter fine-tuning |
14 GB |
14 GB |
56 GB |
1.4 GB |
2 GB |
87.4 GB |
|
LoRA (low-rank adaptation) |
14 GB |
0.14 GB |
0.56 GB |
1.4 GB |
2 GB |
18.1 GB |
|
QLoRA (8-bit quantization + LoRA) |
7 GB |
0.14 GB |
0.56 GB |
1.4 GB |
2 GB |
11.1 GB |
|
QLoRA (4-bit quantization + LoRA) |
3.5 GB |
0.14 GB |
0.56 GB |
1.4 GB |
2 GB |
7.6 GB |
-
Large Language Models (LLMs) typically use the Adam/AdamW optimizer.
-
In the table, all parameters use 16-bit (2-byte) storage, except for QLoRA models that use 4-bit or 8-bit storage and optimizer states that use 32-bit (4-byte) storage.
FAQ
Q: How do I view Large Language Model (LLM) parameter counts?
For open source Large Language Models (LLMs), the parameter count is usually indicated in the model name. For example, Qwen-7B has a parameter count of
Q: How do I view Large Language Model (LLM) parameter precision?
Unless otherwise specified, Large Language Models (LLMs) typically use 16-bit (2-byte) storage. For quantized models, they might use 8-bit or 4-bit storage. For more information, see the model's documentation. For example, if you use a model from the PAI Model Gallery, its product page usually describes the parameter precision:
Qwen2.5-7B-Instruct training instructions:

Q: How do I view the optimizer and state precision used for Large Language Model (LLM) fine-tuning?
Large Language Model (LLM) training typically uses the Adam/AdamW optimizer with 32-bit (4-byte) parameter precision. For more detailed configurations, you can check the start command or code.
Q: How do I view GPU memory usage?
You can view GPU memory usage on the graphical monitoring page of PAI-DSW, PAI-EAS, or PAI-DLC:

Alternatively, you can run the nvidia-smi command in the container's terminal to view GPU usage:

Q: What are common out-of-GPU-memory errors?
When the NVIDIA GPU memory is insufficient, you will receive a CUDA out of memory. Tried to allocate X GB error. In this case, you can increase the GPU memory or reduce the batch size, sequence length, or other parameters.