The GPU memory required to deploy and fine-tune a Large Language Model (LLM) depends on factors such as the number of parameters, precision, and sequence length. You can use the calculator in this topic to quickly estimate your GPU memory needs and select the right GPU specification.
1. Simple GPU memory calculator
This topic estimates the GPU memory required for LLM deployment and fine-tuning based on common calculation methods. The actual GPU memory usage may differ due to variations in model network structures and algorithms.
For Mixture-of-Experts (MoE) models, such as DeepSeek-R1-671B, all 671B model parameters must be loaded. However, only 37B parameters are activated during inference. Therefore, you must calculate the GPU memory for activation values based on the 37B parameter count.
During model fine-tuning, model parameters, activation values, and gradients are typically stored in 16-bit precision. The Adam/AdamW optimizer is used, and its state is stored in 32-bit precision.
Inference
| Scenario | Required GPU memory (GB) |
|---|---|
| Inference (16-bit) | - |
| Inference (8-bit) | - |
| Inference (4-bit) | - |
Fine-tuning
| Scenario | Required GPU memory (GB) |
|---|---|
| Full fine-tuning | - |
| LoRA fine-tuning | - |
| QLoRA (8-bit) fine-tuning | - |
| QLoRA (4-bit) fine-tuning | - |
2. Factors that affect GPU memory for model inference
The GPU memory required for model inference consists of the following main components:
2.1 Model parameters
During model inference, the model parameters must be stored. The formula to calculate the GPU memory they occupy is: Number of parameters × Parameter precision. Common precisions are FP32 (4 bytes), FP16 (2 bytes), and BF16 (2 bytes). For LLMs, parameters are typically stored in FP16 or BF16. For example, for a 7B model with FP16 precision, the required GPU memory is:
2.2 Activation values
During LLM inference, the activation value of each neuron layer must be calculated. The GPU memory this occupies is positively correlated with the batch size, sequence length, and model architecture, such as the number of layers and hidden layer size. The relationship can be expressed as:
Where:
b (batch size): The batch size for a single request. This is typically 1 for online services but can be greater than 1 for batch processing.
s (sequence length): The total sequence length, including input and output (number of tokens).
h (hidden size): The hidden layer dimension of the model.
L (Layers): The number of Transformer layers in the model.
param_bytes: The precision for storing activation values, typically 2 bytes.
Based on these factors and practical experience, you can simplify the estimation. For a 7B model with a batch size (b) of 1, a sequence length (s) of 2048, and param_bytes of 2, the memory for activation values can be roughly estimated as 10% of the model's memory footprint. This provides a safe margin. The calculation is:
2.3 KV cache
To accelerate LLM inference, the key (K) and value (V) matrices calculated for each Transformer layer are typically cached. This avoids recalculating the attention mechanism parameters for all previous tokens at each time step. Using a KV cache reduces the computational complexity from
Where:
2: Represents the two matrices that need to be stored: K (Key) and V (Value).
b (batch size): The batch size for a single request. This is typically 1 for online services but can be greater than 1 for batch processing.
s (sequence length): The total sequence length, including input and output (number of tokens).
h (hidden size): The hidden layer dimension of the model.
L (Layers): The number of Transformer layers in the model.
C (Concurrency): The concurrency of service interface requests.
param_bytes: The precision for storing activation values, typically 2 bytes.
Based on these factors and practical experience, you can simplify the estimation. For a 7B model with a concurrency (C) of 1, a batch size (b) of 1, a sequence length (s) of 2048, and param_bytes of 2, the memory for the KV cache can be roughly estimated as 10% of the model's memory footprint. This provides a safe margin. The value is:
2.4 Other factors
In addition to the factors above, the input data for the current batch, CUDA cores, and the deep learning framework itself, such as PyTorch or TensorFlow, also consume some GPU memory, typically 1 GB to 2 GB.
Based on this analysis, the minimum GPU memory required to deploy a 7B LLM for inference is approximately:
3. Factors that affect GPU memory for model fine-tuning
The GPU memory required for model fine-tuning consists of the following main components:
3.1 Model parameters
During fine-tuning, the model parameters must be stored. The formula to calculate the GPU memory they occupy is: Number of parameters × Parameter precision. Common precisions are FP32 (4 bytes), FP16 (2 bytes), and BF16 (2 bytes). For LLM fine-tuning, parameters are typically stored in FP16 or BF16. For example, for a 7B model with FP16 precision, the required GPU memory is:
3.2 Gradient parameters
During the backward propagation phase of model training, gradients must be calculated for the model parameters. The number of gradients is the same as the number of parameters being trained. In LLMs, gradients are typically stored with 2-byte precision. Therefore, for a 7B model, the GPU memory required for gradients varies based on the fine-tuning method:
Fine-tuning method | Training mechanism | Scenarios | GPU memory for gradients (7B model, 1% parameters, 2-byte storage) |
Full-parameter fine-tuning | The number of trainable parameters is the same as the number of model parameters. | High-precision requirements with sufficient computing power | 14 GB |
LoRA (low-rank adaptation) | LoRA fine-tuning freezes the original model parameters and trains only the low-rank matrices. The number of trainable parameters depends on the model structure and the size of the low-rank matrices, typically accounting for 0.1% to 1% of the total model parameters. | Adapting to specific tasks with limited resources | 0.14 GB |
QLoRA (Quantization + LoRA) | The pre-trained model is compressed to 4-bit or 8-bit. The model is then fine-tuned using LoRA. Double quantization and paged optimizers are introduced to further reduce GPU memory usage. The number of trainable parameters is typically 0.1% to 1% of the total model parameters. | Fine-tuning very large-scale models | 0.14 GB |
3.3 Optimizer state
During training, the state of the optimizer must also be saved. The number of state values is related to the number of trainable parameters. Additionally, models often use mixed-precision training. This means model parameters and gradients are stored in 2-byte precision, while the optimizer state is stored in 4-byte precision. This practice ensures high precision during parameter updates and prevents numerical instability or overflow caused by the limited dynamic range of FP16/BF16. Storing the state in 4-byte precision doubles the required GPU memory. The following table describes common optimizers:
Optimizer type | Parameter update mechanism | Additional storage requirement (each trainable parameter) | Scenarios | GPU memory for optimizer state (7B model, 4-byte storage) | ||
Full-parameter fine-tuning | LoRA fine-tuning (1% parameters) | QLoRA fine-tuning (1% parameters) | ||||
SGD | Uses only the current gradient | 0 (no additional state) | Small models or experiments | 0 | 0 | 0 |
SGD + Momentum | Includes momentum term | 1 floating-point number (momentum) | Better stability | 28 GB | 0.28 GB | 0.28 GB |
RMSProp | Adaptive learning rate | 1 floating-point number (second moment) | Non-convex optimization | 28 GB | 0.28 GB | 0.28 GB |
Adam/AdamW | Momentum + Adaptive learning rate | 2 floating-point numbers (first and second moments) | Common for LLMs | 56 GB | 0.56 GB | 0.56 GB |
3.4 Activation values
During training, intermediate activation values from the forward propagation pass must be stored to calculate gradients during backward propagation. This memory consumption is positively correlated with the batch size, sequence length, and model architecture, such as the number of layers and hidden layer size. The relationship can be expressed as:
Where:
b (batch size): The batch size.
s (sequence length): The total sequence length, including input and output (number of tokens).
h (hidden size): The hidden layer dimension of the model.
L (Layers): The number of Transformer layers in the model.
param_bytes: The precision for storing activation values, typically 2 bytes.
Based on these factors and practical experience, you can simplify the estimation. For a 7B model with a batch size (b) of 1, a sequence length (s) of 2048, and param_bytes of 2, the memory for activation values can be roughly estimated as 10% of the model's memory footprint. This provides a safe margin. The calculation is:
3.5 Other factors
In addition to the factors above, the input data for the current batch, CUDA cores, and the deep learning framework itself, such as PyTorch or TensorFlow, also consume some GPU memory, typically 1 GB to 2 GB.
Based on this analysis, the approximate GPU memory required for fine-tuning a 7B LLM is:
Fine-tuning method | Model GPU memory requirements | Gradient memory | Adam optimizer state | Activation values | Others | Total |
Full-parameter fine-tuning | 14 GB | 14 GB | 56 GB | 1.4 GB | 2 GB | 87.4 GB |
LoRA (low-rank adaptation) | 14 GB | 0.14 GB | 0.56 GB | 1.4 GB | 2 GB | 18.1 GB |
QLoRA (8-bit quantization + LoRA) | 7 GB | 0.14 GB | 0.56 GB | 1.4 GB | 2 GB | 11.1 GB |
QLoRA (4-bit quantization + LoRA) | 3.5 GB | 0.14 GB | 0.56 GB | 1.4 GB | 2 GB | 7.6 GB |
LLMs typically use the Adam/AdamW optimizer.
In the table, QLoRA model parameters are stored in 4-bit or 8-bit precision, and the optimizer state is stored in 32-bit (4-byte) precision. All other parameters are stored in 16-bit (2-byte) precision.
4. FAQ
Q: How do I check the number of parameters in an LLM?
A: For open source LLMs, the number of parameters is usually indicated in the model name. For example, Qwen-7B has
Q: How do I check the parameter precision of an LLM?
A: Unless otherwise specified, LLMs typically use 16-bit (2-byte) storage. For quantized models, 8-bit or 4-bit storage may be used. For more information, see the model's documentation. For example, if you use a model from the PAI Model Gallery, the product page usually describes the parameter precision:
Qwen2.5-7B-Instruct training instructions:

Q: How do I check the optimizer and its state precision for LLM fine-tuning?
A: LLM training typically uses the Adam/AdamW optimizer with 32-bit (4-byte) parameter precision. For more detailed configurations, you can check the start command or the code.
Q: How do I check GPU memory usage?
A: You can view GPU memory usage on the graphical monitoring pages of PAI-DSW, PAI-EAS, or PAI-DLC:

Alternatively, you can run the nvidia-smi command in the container's terminal to check GPU usage:

Q: What are common "out of memory" errors?
A: When you run out of GPU memory on an NVIDIA GPU, a CUDA out of memory. Tried to allocate X GB error occurs. When this happens, you must increase the GPU memory or reduce parameters, such as the batch size or sequence length.