Learn what factors determine GPU memory requirements for LLM deployment and fine-tuning, and how to estimate the memory needed.
Simple GPU memory estimator
-
These estimates use general calculation methods. Actual GPU memory usage may differ due to model-specific network structures and algorithms.
-
For MoE models like DeepSeek-R1-671B, all 671B parameters must be loaded, but only 37B are activated during inference. Use the activated parameter count when calculating activation memory.
-
During fine-tuning, parameters, activations, and gradients typically use 16-bit storage. The Adam/AdamW optimizer stores states in 32-bit format.
Inference
| Scenario | Required GPU Memory (GB) |
|---|---|
| Inference (16-bit) | - |
| Inference (8-bit) | - |
| Inference (4-bit) | - |
Fine-tuning
| Scenario | Required GPU Memory (GB) |
|---|---|
| Full-parameter fine-tuning | - |
| LoRA fine-tuning | - |
| QLoRA (8-bit) fine-tuning | - |
| QLoRA (4-bit) fine-tuning | - |
GPU memory components for inference
GPU memory for inference consists of:
Model parameters
Model parameters must reside in GPU memory during inference. The formula is: parameter count × parameter precision. Common precisions include FP32 (4 bytes), FP16 (2 bytes), and BF16 (2 bytes). LLMs typically use FP16 or BF16. For example, a 7B FP16 model requires:
Activation values
During inference, activation values for each layer must be calculated. Memory usage scales with batch size, sequence length, and model architecture (layer count and hidden size):
Where:
-
b (batch size): Batch size per request. Typically 1 for online services, higher for batch processing.
-
s (sequence length): Total token count (input + output).
-
h (hidden size): Hidden layer dimension.
-
L (Layers): Number of Transformer layers.
-
param_bytes: Storage precision for activations, typically 2 bytes.
In practice, for a 7B model with b=1, s=2048, and param_bytes=2, activation memory is approximately 10% of total model memory:
KV cache
LLMs cache the Key (K) and Value (V) matrices for each Transformer layer to avoid recomputing attention for all historical tokens. This reduces computation from
Where:
-
2: Two matrices stored (Key and Value).
-
b (batch size): Batch size per request. Typically 1 for online services, higher for batch processing.
-
s (sequence length): Total token count (input + output).
-
h (hidden size): Hidden layer dimension.
-
L (Layers): Number of Transformer layers.
-
C (Concurrency): Number of concurrent requests.
-
param_bytes: Storage precision, typically 2 bytes.
In practice, for a 7B model with C=1, b=1, s=2048, and param_bytes=2, KV cache memory is approximately 10% of total model memory:
Other factors
Input data, CUDA cores, and the deep learning framework (PyTorch or TensorFlow) also consume 1-2 GB of GPU memory.
Minimum GPU memory for 7B LLM inference:
GPU memory components for fine-tuning
GPU memory for fine-tuning consists of:
Model parameters
Model parameters must reside in GPU memory during fine-tuning. The formula is: parameter count × parameter precision. Common precisions include FP32 (4 bytes), FP16 (2 bytes), and BF16 (2 bytes). LLMs typically use FP16 or BF16. For example, a 7B FP16 model requires:
Gradient parameters
During backpropagation, gradients are computed for each trainable parameter. LLMs typically store gradients in 2-byte precision. GPU memory for gradients varies by fine-tuning method:
|
Fine-tuning method |
Training mechanism |
Scenarios |
GPU memory required for 7B model fine-tuning gradients (calculated with 1% parameters, 2-byte storage) |
|
Full-parameter fine-tuning |
Parameters to train are the same as the model's own parameters |
High-precision requirements with sufficient computing power |
14 GB |
|
LoRA (low-rank adaptation) |
LoRA fine-tuning freezes original model parameters and only trains low-rank matrices. The parameters to be trained depend on the model structure and the size of the low-rank matrices, typically accounting for about 0.1% to 1% of the total model parameters. |
Low-resource adaptation for specific tasks |
0.14 GB |
|
QLoRA (quantization + LoRA) |
Compresses the pre-trained model to 4-bit or 8-bit, uses LoRA to fine-tune the model, and introduces double quantization and paged optimizers, further reducing GPU memory usage. The parameters to be trained typically account for about 0.1% to 1% of the total model parameters. |
Fine-tuning for ultra-large models |
0.14 GB |
Optimizer states
Optimizer states must be stored during training. Mixed-precision training stores parameters and gradients in 2 bytes but optimizer states in 4 bytes to prevent numerical instability from FP16/BF16's limited dynamic range. Common optimizers:
|
Optimizer type |
Parameter update mechanism |
Additional Storage Requirement (each parameter to be trained) |
Scenarios |
GPU memory required for 7B model fine-tuning optimizer states (4-byte storage) |
||
|
Full-parameter fine-tuning |
LoRA fine-tuning (calculated with 1% parameters) |
QLoRA fine-tuning (calculated with 1% parameters) |
||||
|
SGD |
Uses only current gradients |
0 (no additional states) |
Small models or experiments |
0 |
0 |
0 |
|
SGD + Momentum |
With momentum term |
1 floating-point number (momentum) |
Better stability |
28 GB |
0.28 GB |
0.28 GB |
|
RMSProp |
Adaptive learning rate |
1 floating-point number (second moment) |
Non-convex optimization |
28 GB |
0.28 GB |
0.28 GB |
|
Adam/AdamW |
Momentum + adaptive learning rate |
2 floating-point numbers (first + second moment) |
Common for Large Language Models (LLMs) |
56 GB |
0.56 GB |
0.56 GB |
Activation values
Forward propagation generates intermediate activations needed for gradient computation during backpropagation. Memory scales with batch size, sequence length, and model architecture:
Where:
-
b (batch size): Training batch size.
-
s (sequence length): Total token count (input + output).
-
h (hidden size): Hidden layer dimension.
-
L (Layers): Number of Transformer layers.
-
param_bytes: Storage precision for activations, typically 2 bytes.
In practice, for a 7B model with b=1, s=2048, and param_bytes=2, activation memory is approximately 10% of total model memory:
Other factors
Input data, CUDA cores, and the deep learning framework (PyTorch or TensorFlow) also consume 1-2 GB of GPU memory.
Approximate GPU memory for fine-tuning a 7B LLM:
|
Fine-tuning method |
GPU memory required for the model itself |
GPU memory required for gradients |
Adam optimizer states |
Activation values |
Other |
Total |
|
Full-parameter fine-tuning |
14 GB |
14 GB |
56 GB |
1.4 GB |
2 GB |
87.4 GB |
|
LoRA (low-rank adaptation) |
14 GB |
0.14 GB |
0.56 GB |
1.4 GB |
2 GB |
18.1 GB |
|
QLoRA (8-bit quantization + LoRA) |
7 GB |
0.14 GB |
0.56 GB |
1.4 GB |
2 GB |
11.1 GB |
|
QLoRA (4-bit quantization + LoRA) |
3.5 GB |
0.14 GB |
0.56 GB |
1.4 GB |
2 GB |
7.6 GB |
-
LLMs typically use the Adam/AdamW optimizer.
-
In the table, all parameters use 16-bit (2-byte) storage, except for QLoRA models that use 4-bit or 8-bit storage and optimizer states that use 32-bit (4-byte) storage.
FAQ
Q: How do I find an LLM's parameter count?
For open-source LLMs, the parameter count is usually in the model name. For example, Qwen-7B has a parameter count of
Q: How do I find an LLM's parameter precision?
LLMs typically use 16-bit (2-byte) storage unless otherwise specified. Quantized models may use 8-bit or 4-bit. Check the model documentation for details. For PAI Model Gallery models, the product page shows the parameter precision:
Qwen2.5-7B-Instruct training instructions:

Q: How do I find the optimizer and state precision for fine-tuning?
LLM training typically uses Adam/AdamW with 32-bit (4-byte) parameter precision. Check the start command or training code for details.
Q: How do I view GPU memory usage?
View GPU memory on the monitoring page of PAI-DSW, PAI-EAS, or PAI-DLC:

Or run nvidia-smi in the container terminal:

Q: What are common out-of-GPU-memory errors?
Insufficient GPU memory triggers a CUDA out of memory. Tried to allocate X GB error. To resolve this, increase GPU memory or reduce the batch size, sequence length, or other parameters.