gpu memory, large language model, llm, fine-tuning, inference, calculator - Platform For AI

The GPU memory required to deploy and fine-tune a Large Language Model (LLM) depends on factors such as the number of parameters, precision, and sequence length. You can use the calculator in this topic to quickly estimate your GPU memory needs and select the right GPU specification.

1. Simple GPU memory calculator

Note

This topic estimates the GPU memory required for LLM deployment and fine-tuning based on common calculation methods. The actual GPU memory usage may differ due to variations in model network structures and algorithms.
For Mixture-of-Experts (MoE) models, such as DeepSeek-R1-671B, all 671B model parameters must be loaded. However, only 37B parameters are activated during inference. Therefore, you must calculate the GPU memory for activation values based on the 37B parameter count.
During model fine-tuning, model parameters, activation values, and gradients are typically stored in 16-bit precision. The Adam/AdamW optimizer is used, and its state is stored in 32-bit precision.

Inference

Title

Model type:

Total model parameters: B

Inference concurrency:

Sequence length:

Scenario	Required GPU memory (GB)
Inference (16-bit)	-
Inference (8-bit)	-
Inference (4-bit)	-

Fine-tuning

Model type:

Total model parameters: B

Parameter precision: bit

Optimizer:

Training batch size:

Sequence length:

Scenario	Required GPU memory (GB)
Full fine-tuning	-
LoRA fine-tuning	-
QLoRA (8-bit) fine-tuning	-
QLoRA (4-bit) fine-tuning	-

2. Factors that affect GPU memory for model inference

The GPU memory required for model inference consists of the following main components:

2.1 Model parameters

During model inference, the model parameters must be stored. The formula to calculate the GPU memory they occupy is: Number of parameters × Parameter precision. Common precisions are FP32 (4 bytes), FP16 (2 bytes), and BF16 (2 bytes). For LLMs, parameters are typically stored in FP16 or BF16. For example, for a 7B model with FP16 precision, the required GPU memory is:

$\frac{7 \times 1 0 ^{9} \times 2}{1 0 ^{9}} = 14 GB$

2.2 Activation values

During LLM inference, the activation value of each neuron layer must be calculated. The GPU memory this occupies is positively correlated with the batch size, sequence length, and model architecture, such as the number of layers and hidden layer size. The relationship can be expressed as: $Activation value memory \propto b \times s \times h \times L \times p a r am_b y t es$

Where:

b (batch size): The batch size for a single request. This is typically 1 for online services but can be greater than 1 for batch processing.
s (sequence length): The total sequence length, including input and output (number of tokens).
h (hidden size): The hidden layer dimension of the model.
L (Layers): The number of Transformer layers in the model.
param_bytes: The precision for storing activation values, typically 2 bytes.

Based on these factors and practical experience, you can simplify the estimation. For a 7B model with a batch size (b) of 1, a sequence length (s) of 2048, and param_bytes of 2, the memory for activation values can be roughly estimated as 10% of the model's memory footprint. This provides a safe margin. The calculation is: $14 GB \times 0.1 = 1.4 GB$ .

2.3 KV cache

To accelerate LLM inference, the key (K) and value (V) matrices calculated for each Transformer layer are typically cached. This avoids recalculating the attention mechanism parameters for all previous tokens at each time step. Using a KV cache reduces the computational complexity from $O (n^{2})$ to $O (n)$ , which significantly improves the inference speed. Similar to activation values, the GPU memory occupied by the KV cache is positively correlated with the batch size, sequence length, concurrency, and model architecture, such as the number of layers and hidden layer size. The relationship can be expressed as: $K V cache memory \propto 2 \times b \times s \times h \times L \times C \times p a r am_b y t es$

Where:

2: Represents the two matrices that need to be stored: K (Key) and V (Value).
b (batch size): The batch size for a single request. This is typically 1 for online services but can be greater than 1 for batch processing.
s (sequence length): The total sequence length, including input and output (number of tokens).
h (hidden size): The hidden layer dimension of the model.
L (Layers): The number of Transformer layers in the model.
C (Concurrency): The concurrency of service interface requests.
param_bytes: The precision for storing activation values, typically 2 bytes.

Based on these factors and practical experience, you can simplify the estimation. For a 7B model with a concurrency (C) of 1, a batch size (b) of 1, a sequence length (s) of 2048, and param_bytes of 2, the memory for the KV cache can be roughly estimated as 10% of the model's memory footprint. This provides a safe margin. The value is: $1.4 GB$ .

2.4 Other factors

In addition to the factors above, the input data for the current batch, CUDA cores, and the deep learning framework itself, such as PyTorch or TensorFlow, also consume some GPU memory, typically 1 GB to 2 GB.

Based on this analysis, the minimum GPU memory required to deploy a 7B LLM for inference is approximately:

$14 GB + 1.4 GB + 1.4 GB + 2 GB = 18.8 GB$

3. Factors that affect GPU memory for model fine-tuning

The GPU memory required for model fine-tuning consists of the following main components:

3.1 Model parameters

During fine-tuning, the model parameters must be stored. The formula to calculate the GPU memory they occupy is: Number of parameters × Parameter precision. Common precisions are FP32 (4 bytes), FP16 (2 bytes), and BF16 (2 bytes). For LLM fine-tuning, parameters are typically stored in FP16 or BF16. For example, for a 7B model with FP16 precision, the required GPU memory is:

$\frac{7 \times 1 0 ^{9} \times 2}{1 0 ^{9}} = 14 GB$

3.2 Gradient parameters

During the backward propagation phase of model training, gradients must be calculated for the model parameters. The number of gradients is the same as the number of parameters being trained. In LLMs, gradients are typically stored with 2-byte precision. Therefore, for a 7B model, the GPU memory required for gradients varies based on the fine-tuning method:

Fine-tuning method	Training mechanism	Scenarios	GPU memory for gradients (7B model, 1% parameters, 2-byte storage)
Full-parameter fine-tuning	The number of trainable parameters is the same as the number of model parameters.	High-precision requirements with sufficient computing power	14 GB
LoRA (low-rank adaptation)	LoRA fine-tuning freezes the original model parameters and trains only the low-rank matrices. The number of trainable parameters depends on the model structure and the size of the low-rank matrices, typically accounting for 0.1% to 1% of the total model parameters.	Adapting to specific tasks with limited resources	0.14 GB
QLoRA (Quantization + LoRA)	The pre-trained model is compressed to 4-bit or 8-bit. The model is then fine-tuned using LoRA. Double quantization and paged optimizers are introduced to further reduce GPU memory usage. The number of trainable parameters is typically 0.1% to 1% of the total model parameters.	Fine-tuning very large-scale models	0.14 GB

3.3 Optimizer state

During training, the state of the optimizer must also be saved. The number of state values is related to the number of trainable parameters. Additionally, models often use mixed-precision training. This means model parameters and gradients are stored in 2-byte precision, while the optimizer state is stored in 4-byte precision. This practice ensures high precision during parameter updates and prevents numerical instability or overflow caused by the limited dynamic range of FP16/BF16. Storing the state in 4-byte precision doubles the required GPU memory. The following table describes common optimizers:

Optimizer type	Parameter update mechanism	Additional storage requirement (each trainable parameter)	Scenarios	GPU memory for optimizer state (7B model, 4-byte storage)
Optimizer type	Parameter update mechanism	Additional storage requirement (each trainable parameter)	Scenarios	Full-parameter fine-tuning	LoRA fine-tuning (1% parameters)	QLoRA fine-tuning (1% parameters)
SGD	Uses only the current gradient	0 (no additional state)	Small models or experiments	0	0	0
SGD + Momentum	Includes momentum term	1 floating-point number (momentum)	Better stability	28 GB	0.28 GB	0.28 GB
RMSProp	Adaptive learning rate	1 floating-point number (second moment)	Non-convex optimization	28 GB	0.28 GB	0.28 GB
Adam/AdamW	Momentum + Adaptive learning rate	2 floating-point numbers (first and second moments)	Common for LLMs	56 GB	0.56 GB	0.56 GB

3.4 Activation values

During training, intermediate activation values from the forward propagation pass must be stored to calculate gradients during backward propagation. This memory consumption is positively correlated with the batch size, sequence length, and model architecture, such as the number of layers and hidden layer size. The relationship can be expressed as: $Activation value memory \propto b \times s \times h \times L \times p a r am_b y t es$

Where:

b (batch size): The batch size.
s (sequence length): The total sequence length, including input and output (number of tokens).
h (hidden size): The hidden layer dimension of the model.
L (Layers): The number of Transformer layers in the model.
param_bytes: The precision for storing activation values, typically 2 bytes.

3.5 Other factors

Based on this analysis, the approximate GPU memory required for fine-tuning a 7B LLM is:

Fine-tuning method	Model GPU memory requirements	Gradient memory	Adam optimizer state	Activation values	Others	Total
Full-parameter fine-tuning	14 GB	14 GB	56 GB	1.4 GB	2 GB	87.4 GB
LoRA (low-rank adaptation)	14 GB	0.14 GB	0.56 GB	1.4 GB	2 GB	18.1 GB
QLoRA (8-bit quantization + LoRA)	7 GB	0.14 GB	0.56 GB	1.4 GB	2 GB	11.1 GB
QLoRA (4-bit quantization + LoRA)	3.5 GB	0.14 GB	0.56 GB	1.4 GB	2 GB	7.6 GB

Note

LLMs typically use the Adam/AdamW optimizer.
In the table, QLoRA model parameters are stored in 4-bit or 8-bit precision, and the optimizer state is stored in 32-bit (4-byte) precision. All other parameters are stored in 16-bit (2-byte) precision.

4. FAQ

Q: How do I check the number of parameters in an LLM?

A: For open source LLMs, the number of parameters is usually indicated in the model name. For example, Qwen-7B has $7 \times 1 0^{9}$ parameters. Qwen3-235B-A22B has a total of $235 \times 1 0^{9}$ parameters, with $22 \times 1 0^{9}$ parameters activated during inference. For models that do not specify the parameter count in their name, you can search for and review the model's documentation to find this information.

Q: How do I check the parameter precision of an LLM?

A: Unless otherwise specified, LLMs typically use 16-bit (2-byte) storage. For quantized models, 8-bit or 4-bit storage may be used. For more information, see the model's documentation. For example, if you use a model from the PAI Model Gallery, the product page usually describes the parameter precision:

Qwen2.5-7B-Instruct training instructions:

Q: How do I check the optimizer and its state precision for LLM fine-tuning?

A: LLM training typically uses the Adam/AdamW optimizer with 32-bit (4-byte) parameter precision. For more detailed configurations, you can check the start command or the code.

Q: How do I check GPU memory usage?

A: You can view GPU memory usage on the graphical monitoring pages of PAI-DSW, PAI-EAS, or PAI-DLC:

Alternatively, you can run the nvidia-smi command in the container's terminal to check GPU usage:

Q: What are common "out of memory" errors?

A: When you run out of GPU memory on an NVIDIA GPU, a CUDA out of memory. Tried to allocate X GB error occurs. When this happens, you must increase the GPU memory or reduce parameters, such as the batch size or sequence length.