DeepSeek-R1 excels at math, coding, and reasoning tasks. DeepSeek open-sourced six dense distill models based on Llama and Qwen. This topic demonstrates fine-tuning DeepSeek-R1-Distill-Qwen-7B in PAI Model Gallery.
Supported models
Model Gallery supports LoRA SFT for the six distill models. The following table lists minimum configurations with default parameters:
|
Distill model |
Base model |
Training method |
Minimum configuration |
|
DeepSeek-R1-Distill-Qwen-1.5B |
LoRA supervised fine-tuning |
1 x A10 (24 GB video memory) |
|
|
DeepSeek-R1-Distill-Qwen-7B |
1 x A10 (24 GB video memory) |
||
|
DeepSeek-R1-Distill-Llama-8B |
1 x A10 (24 GB video memory) |
||
|
DeepSeek-R1-Distill-Qwen-14B |
1 x GU8IS (48 GB video memory) |
||
|
DeepSeek-R1-Distill-Qwen-32B |
2 x GU8IS (48 GB video memory) |
||
|
DeepSeek-R1-Distill-Llama-70B |
8 x GU100 (80 GB video memory) |
Train the model
-
Go to the Model Gallery page.
-
Log on to the PAI console.
-
In the upper-left corner, select a region.
-
In the left pane, click Workspaces. On the Workspaces page, click a workspace name.
-
In the left pane, choose QuickStart > Model Gallery.
-
-
On the Model Gallery page, click the DeepSeek-R1-Distill-Qwen-7B model card to go to the details page.
This page displays deployment, training details, SFT data format, and invocation methods.

-
Click Train in the upper-right corner and configure the following key parameters:
-
Dataset Configuration: Upload prepared data to an OSS bucket.
-
Computing Resources: Minimum configurations are listed in Supported models. Adjusting hyperparameters may require more memory.
-
Hyperparameters: Adjust these LoRA SFT hyperparameters based on your data and resources. For details, see Fine-tune LLMs.
Hyperparameter
Type
Default value
(for 7B model as an example)
Description
learning_rate
float
5e-6
Controls weight adjustment magnitude.
num_train_epochs
int
6
Number of training epochs (dataset iterations).
per_device_train_batch_size
int
2
Samples per GPU per iteration. Higher values increase efficiency and memory usage.
gradient_accumulation_steps
int
2
The number of gradient accumulation steps.
max_length
int
1024
Max tokens per sample.
lora_rank
int
8
LoRA dimension.
lora_alpha
int
32
LoRA scaling factor.
lora_dropout
float
0
LoRA dropout rate. Randomly drops neurons during training to prevent overfitting.
lorap_lr_ratio
float
16
LoRA+ learning rate ratio (λ = ηB/ηA). Uses different rates for adapter matrices A and B. Set to 0 for standard LoRA.
-
-
Click Train. The training page shows job status and logs.

-
On success, the model is registered in AI Asset Management - Models for deployment. See Register and manage models.
-
On failure, click
next to Status or check the Task log tab. For common errors, see FAQ and Model Gallery FAQ. -
Metric Curve shows the loss progression.

-
-
After training, click Deploy to create an EAS service. Invocation follows the original distill model. See the model detail page or Deploy DeepSeek-V3 and DeepSeek-R1 models.

Billing
Model Gallery training uses DLC, billed by job duration. Resources stop automatically when jobs end. See Billing of DLC.
FAQ
Why does my Model Gallery training job fail?
-
Cause:
max_lengthtoo small. Data exceeding this limit is discarded:
Solution: Increase max_length. If too much data is discarded, training or validation datasets may become empty, causing failure:
-
Error:
failed to compose dlc job specs, resource limiting triggered, you are trying to use more GPU resources than the thresholdSolution: Training is limited to 2 simultaneous GPUs. Wait for ongoing jobs to finish, or submit a ticket to increase quota.
-
Error:
the specified vswitch vsw-**** cannot create the required resource ecs.gn7i-c32g1.8xlarge, zone not matchSolution: The requested instance type is unavailable in the current zone. Try one of these:
-
Leave vSwitch empty. DLC auto-selects one based on inventory.
-
Switch to a different instance type.
-
How do I download the trained model from Model Gallery?
Set the model output path to an OSS directory when creating the training job, then download the model from OSS.

How can I improve poor model performance after fine-tuning?
Try the following approaches:
-
Use a larger model with better baseline performance, such as DeepSeek or Qwen3 series with higher parameter counts.
-
Refine your prompts.
-
Increase
max_tokens. -
Break complex tasks into smaller subtasks for the model to handle separately.
What is the required OSS path structure for custom model uploads?
When uploading a fine-tuned model to OSS for deployment, the model directory must contain all required configuration files. Missing files like generation_config.json can cause deployment failures.
Required OSS path structure:
oss://your-bucket/model-path/
├── config.json # Model architecture configuration
├── generation_config.json # Generation parameters (required)
├── tokenizer_config.json # Tokenizer configuration
├── tokenizer.json # Tokenizer vocabulary
├── special_tokens_map.json # Special tokens mapping
├── pytorch_model.bin # Model weights (or .safetensors)
├── adapter_config.json # LoRA adapter config (for LoRA models)
└── adapter_model.bin # LoRA adapter weights (for LoRA models)
Example generation_config.json:
{
"bos_token_id": 151643,
"do_sample": true,
"eos_token_id": [
151645,
151643
],
"pad_token_id": 151643,
"repetition_penalty": 1.0,
"temperature": 0.3,
"top_k": 20,
"top_p": 0.8,
"transformers_version": "4.37.0"
}
Solutions if files are missing:
-
Copy from base model: If
generation_config.jsonis missing, copy it from the original base model directory used for fine-tuning. -
Create manually: Create a
generation_config.jsonfile with default parameters (see example above). Adjustbos_token_id,eos_token_id, andpad_token_idto match your tokenizer. -
Upload complete model directory: When setting the model output path during training (see How do I download the trained model from Model Gallery?), ensure the training job saves all configuration files, not just model weights.
For LoRA fine-tuned models, keep both base model files and adapter files in the same directory, or configure separate paths for base model and adapter in your deployment configuration.