This topic describes the factors that affect the GPU memory required for deploying and fine-tuning LLMs and explains how to estimate the required GPU memory.
Simple GPU memory estimator
This topic estimates the GPU memory required for deploying and fine-tuning LLMs based on common calculation methods. The actual GPU memory usage may vary due to differences in model network structures and algorithms.
For a Mixture-of-Experts (MoE) model, such as DeepSeek-R1-671B, all 671B parameters of the model must be loaded. However, only 37B parameters are activated during inference. Therefore, when you calculate the GPU memory occupied by activation values, use the 37B parameter count.
When fine-tuning a model, 16-bit precision is typically used to store model parameters, activation values, and gradients. The Adam or AdamW optimizer is used, and the optimizer state is stored with 32-bit precision.
Inference
| Scenario | Required GPU Memory (GB) |
|---|---|
Inference (16-bit) | - |
Inference (8-bit) | - |
Inference (4-bit) | - |
Fine-tuning
| Scenario | Required GPU Memory (GB) |
|---|---|
Full Fine-tuning | - |
LoRA Fine-tuning | - |
QLoRA (8-bit) Fine-tuning | - |
QLoRA (4-bit) Fine-tuning | - |
Factors that affect GPU memory requirements for model inference
The GPU memory required for model inference consists of the following main components:
Model parameters
During model inference, the model's parameters must be stored in GPU memory. The formula to calculate the required GPU memory is: Number of parameters × Parameter precision. Common parameter precisions include FP32 (4 bytes), FP16 (2 bytes), and BF16 (2 bytes). For large language models (LLMs), model parameters typically use FP16 or BF16. For example, for a 7B model with FP16 precision, the required GPU memory is:
Activation values
During LLM inference, an activation value is calculated for each neuron layer. The required GPU memory is directly proportional to the batch size, sequence length, and model architecture (such as the number of layers and hidden size). This relationship can be expressed as:
Where:
b (batch size): The batch size for a single request. This value is typically 1 for online services but can be larger for batch processing interfaces.
s (sequence length): The total sequence length, including the number of input and output tokens.
h (hidden size): The dimension of the model's hidden layer.
L: The number of Transformer layers in the model.
param_bytes: The number of bytes used to store each activation value, typically 2 bytes.
Based on practical experience and the preceding factors, to simplify the estimation and include a buffer, you can roughly estimate that for a 7B model where b is 1, s is 2048, and param_bytes is 2 bytes, the GPU memory for activation values is approximately 10% of the GPU memory used by the model:
KV cache
To speed up large language model inference, the computed Key (K) and Value (V) for each Transformer layer are cached. This avoids recalculating these values for all historical tokens at each time step. Using a KV cache reduces the computational complexity from
Where:
2: Represents the two matrices that need to be stored, Key (K) and Value (V).
b (batch size): The batch size for a single request. This value is typically 1 for online services but can be larger for batch processing interfaces.
s (sequence length): The total sequence length, including the number of input and output tokens.
h (hidden size): The dimension of the model's hidden layer.
L: The number of Transformer layers in the model.
C: The degree of concurrency for service interface requests.
param_bytes: The number of bytes used to store each value, which is typically 2 bytes.
Based on practical experience and the previously mentioned factors, you can simplify the GPU memory estimation and include a margin. For example, for a 7B model where C is 1, b is 1, s is 2048, and param_bytes is 2 bytes, the GPU memory for the KV cache can also be roughly estimated as 10% of the model's GPU memory usage, which is:
Other factors
In addition to the factors above, the input data for the current batch, the CUDA context, and the deep learning framework itself (such as PyTorch or TensorFlow) also consume GPU memory. This overhead is typically 1 GB to 2 GB.
Based on this analysis, the minimum GPU memory required to deploy a 7B LLM for inference is approximately:
Factors affecting GPU memory for model fine-tuning
The GPU memory required for model fine-tuning depends on the following components:
Model parameters
During fine-tuning, the model parameters must be stored in GPU memory. The memory usage is calculated with the following formula: Number of parameters × Parameter precision. Common parameter precisions include FP32 (4 bytes), FP16 (2 bytes), and BF16 (2 bytes). For LLMs, model parameters typically use FP16 or BF16 precision during fine-tuning. For example, a 7B parameter model that uses FP16 precision requires the following amount of GPU memory:
Gradient parameters
During the backward propagation phase of model training, gradients are calculated for the model parameters. The number of gradients is equal to the number of parameters being trained. LLMs typically store these gradients with 2-byte precision. Therefore, the GPU memory required for a 7B model varies based on the fine-tuning method:
Fine-tuning method | Training mechanism | Scenarios | GPU memory for gradients in 7B model fine-tuning (calculated for 1% of parameters, 2-byte storage) |
Full-parameter fine-tuning | The number of trainable parameters is the same as the total number of model parameters. | High-precision requirements with sufficient computing power. | 14 GB |
LoRA (Low-Rank Adaptation) | LoRA fine-tuning freezes the original model parameters and only trains the low-rank matrices. The number of trainable parameters depends on the model structure and the size of the low-rank matrices, typically accounting for about 0.1% to 1% of the total model parameters. | Adapting to specific tasks with low resources. | 0.14 GB |
QLoRA (Quantization + LoRA) | Compresses the pre-trained model to 4-bit or 8-bit, fine-tunes the model using LoRA, and introduces double quantization and paged optimizers to further reduce GPU memory usage. The number of trainable parameters is typically about 0.1% to 1% of the total model parameters. | Fine-tuning for very large-scale models. | 0.14 GB |
Optimizer state
The optimizer state must also be saved during training. The amount of memory required depends on the number of trainable parameters. Models often use mixed-precision training. In this method, model parameters and gradients use 2-byte storage, while the optimizer state uses 4-byte storage. This approach maintains high precision during parameter updates and prevents numerical instability or overflow that is caused by the limited dynamic range of FP16 or BF16. Because the optimizer state uses 4 bytes per value, it can require significant memory. For example, an optimizer that stores one 4-byte value for each 2-byte parameter requires twice the memory of the parameters. The following table describes common optimizers:
Optimizer type | Parameter update mechanism | Additional storage required (per trainable parameter) | Scenarios | GPU memory for optimizer state in 7B model fine-tuning (4-byte storage) | ||
Full-parameter fine-tuning | LoRA fine-tuning (calculated for 1% of parameters) | QLoRA fine-tuning (calculated for 1% of parameters) | ||||
SGD | Uses only the current gradient | 0 (no extra state) | Small models or experiments | 0 | 0 | 0 |
SGD + Momentum | With momentum term | 1 floating-point number (momentum) | Better stability | 28 GB | 0.28 GB | 0.28 GB |
RMSProp | Adaptive learning rate | 1 floating-point number (second-order moment) | Non-convex optimization | 28 GB | 0.28 GB | 0.28 GB |
Adam/AdamW | Momentum + adaptive learning rate | 2 floating-point numbers (first- and second-order moments) | Commonly used for LLMs | 56 GB | 0.56 GB | 0.56 GB |
Activation values
During training, intermediate activation values from the forward propagation pass must be stored to calculate gradients during backward propagation. The GPU memory consumption for this component is positively correlated with the batch size, sequence length, and model architecture, such as the number of layers and the hidden layer size. The relationship can be expressed using the following formula:
Where:
b: The batch size.
s: The total sequence length, which includes the input and output (number of tokens).
h: The dimension of the model's hidden layers.
L: The number of Transformer layers in the model.
param_bytes: The precision for storing activation values is typically 2 bytes.
Based on practical experience and the factors mentioned above, you can simplify GPU memory estimation and include a margin. For example, for a 7B model with b = 1, s = 2048, and param_bytes = 2 bytes, you can roughly estimate the GPU memory for activation values as 10% of the GPU memory used by the model:
Other factors
In addition to the preceding factors, other components also consume GPU memory. These components include the input data for the current batch, CUDA cores, and the deep learning framework, such as PyTorch or TensorFlow. This usage typically consumes 1 GB to 2 GB of GPU memory.
Based on this analysis, the approximate amount of GPU memory required to fine-tune a 7B LLM is as follows:
Fine-tuning method | GPU memory for the model | GPU memory for gradients | Adam optimizer state | Activation values | Other | Total |
Full-parameter fine-tuning | 14 GB | 14 GB | 56 GB | 1.4 GB | 2 GB | 87.4 GB |
LoRA (Low-Rank Adaptation) | 14 GB | 0.14 GB | 0.56 GB | 1.4 GB | 2 GB | 18.1 GB |
QLoRA (8-bit quantization + LoRA) | 7 GB | 0.14 GB | 0.56 GB | 1.4 GB | 2 GB | 11.1 GB |
QLoRA (4-bit quantization + LoRA) | 3.5 GB | 0.14 GB | 0.56 GB | 1.4 GB | 2 GB | 7.6 GB |
LLMs typically use the Adam or AdamW optimizer.
In the table, the QLoRA model is stored in 4-bit or 8-bit precision, and the optimizer state is stored in 32-bit (4 bytes) precision. All other parameters are stored in 16-bit (2 bytes) precision.
FAQ
Q: How to check the number of parameters in a large model?
For open-source large models, the number of parameters is usually indicated in the model name. For example, Qwen-7B has
Q: How to check the parameter precision of the LLM?
Unless specified otherwise, large models typically use 16 bit (2 byte) storage. Quantized models might use 8 bit or 4 bit storage. For details, refer to the model documentation. For example, if you use a model from the PAI Model Gallery, its product page usually describes the parameter precision:
Qwen2.5-7B-Instruct training instructions:

Q: How to check the optimizer and state precision used for LLM fine-tuning?
LLM training typically uses the Adam or AdamW optimizer. The parameter precision is 32 bit (4 bytes). For more detailed configurations, you can check the startup command or the code.
Q: How to check GPU memory usage?
You can view the GPU memory usage on the graphical monitoring pages of PAI-DSW, PAI-EAS, or PAI-DLC.

Alternatively, you can run the nvidia-smi command in the container's terminal to view GPU usage:

Q: What are the common errors for insufficient GPU memory?
Insufficient NVIDIA GPU memory causes a CUDA out of memory. Tried to allocate X GB error. If this error occurs, you can increase the GPU memory or reduce parameters such as the batch size and sequence length.