All Products
Search
Document Center

Platform For AI:One-click fine-tuning of DeepSeek-R1 distill models

Last Updated:Mar 24, 2025

DeepSeek-R1 is the first-generation reasoning model developed by DeepSeek, excelling in mathematical, coding, and reasoning tasks. DeepSeek has open-sourced the DeepSeek-R1 model and six dense models distilled from DeepSeek-R1 based on Llama and Qwen, all of which have shown impressive performance in various benchmarks. This topic takes DeepSeek-R1-Distill-Qwen-7B as an example to describe how to fine-tune these models in Model Gallery of Platform for AI (PAI).

Supported models

PAI-Model Gallery supports LoRA supervised fine-tuning (SFT) training for the six distill models. The following table describes the recommended minimum configurations under the default parameters and datasets:

Distill model

Base model

Training method

Minimum configuration

DeepSeek-R1-Distill-Qwen-1.5B

Qwen2.5-Math-1.5B

LoRA supervised fine-tuning

1 x A10 (24 GB video memory)

DeepSeek-R1-Distill-Qwen-7B

Qwen2.5-Math-7B

1 x A10 (24 GB video memory)

DeepSeek-R1-Distill-Llama-8B

Llama-3.1-8B

1 x A10 (24 GB video memory)

DeepSeek-R1-Distill-Qwen-14B

Qwen2.5-14B

1 x GU8IS (48 GB video memory)

DeepSeek-R1-Distill-Qwen-32B

Qwen2.5-32B

2 x GU8IS (48 GB video memory)

DeepSeek-R1-Distill-Llama-70B

Llama-3.3-70B-Instruct

8 x GU100 (80 GB video memory)

Train the model

  1. Go to the Model Gallery page.

    1. Log on to the PAI console.

    2. In the upper-left corner, select a region based on your business requirements.

    3. In the left-side navigation pane, click Workspaces. On the Workspaces page, click the name of the workspace that you want to use.

    4. In the left-side navigation pane, choose QuickStart > Model Gallery.

  2. On the Model Gallery page, click the DeepSeek-R1-Distill-Qwen-7B model card to go to the details page.

    This page provides detailed information on model deployment and training, such as the SFT data format and the invocation method.

    image

  3. Click Train in the upper-right corner and configure the following key parameters:

    • Dataset Configuration: After you prepare the data, upload the data to the Object Storage Service (OSS) bucket.

    • Computing Resources: Choose suitable resources. The minimum configurations required under the default setup are listed in Supported models. If you need to adjust the hyperparameters, more video memory may be required.

    • Hyperparameters: The following table describes the hyperparameters supported by LoRA SFT. Adjust these based on your data and computing resources. For more information, see Guide to fine-tuning LLMs.

      Hyperparameter

      Type

      Default value

      (for 7B model as an example)

      Description

      learning_rate

      float

      5e-6

      The learning rate, which controls the magnitude of model weight adjustments.

      num_train_epochs

      int

      6

      The number times the training dataset is reused.

      per_device_train_batch_size

      int

      2

      The number of samples processed by each GPU in one training iteration. A higher value results in higher training efficiency and higher memory usage.

      gradient_accumulation_steps

      int

      2

      The number of gradient accumulation steps.

      max_length

      int

      1024

      The maximum token length of input data processed by the model in one training session.

      lora_rank

      int

      8

      The LoRA dimension.

      lora_alpha

      int

      32

      The LoRA weights.

      lora_dropout

      float

      0

      The LoRA dropout rate. Randomly dropping neurons during the training process helps prevent overfitting.

      lorap_lr_ratio

      float

      16

      The learning rate ratio in LoRA+ is defined as λ = ηB/ηA, where ηA and ηB​ are the learning rates for adapter matrices A and B, respectively. Compared to standard LoRA, LoRA+ allows for the use of different learning rates for critical parts of the process, leading to better performance and faster fine-tuning without increasing computational demands. When lorap_lr_ratio is set to 0, ithe standard LoRA is being used instead of LoRA+.

  4. Click Train. You will be redirected to the model training page, and the training will begin. Here, you can view the status and logs of the training job.

    image

    • If the training is successful, the model will be automatically registered in AI Asset Management - Models, where you can view or deploy it. For more information, see Register and manage models.

    • If the training fails, click image next to Status to discover the cause or go to the Task log tab for more information. For common training errors and solutions, see Usage notes and FAQ about Model Gallery.

    • The Metric Curve section at the bottom of the training page displays the loss progression during training.

      image

  5. After successful training, click Deploy in the upper-right corner to deploy the trained model as an EAS service. The invocation method for the deployed model is the same as that of the original distill model. You can refer to the model detail page or One-click deployment of DeepSeek-V3 and DeepSeek-R1.

    image

Billing

Training in Model Gallery uses the training capacities of Deep Learning Containers (DLC). DLC charges based on the duration of training jobs. After your training job ends, resource consumption will stop automatically and you don't need to stop it manually. Learn about Billing of DLC.

Usage notes

Troubleshooting job failure

  • When training, set an appropriate max_length (hyperparameter in the training configuration). The training algorithm will delete any data exceeding max_length, and the task log will display the following message:

    imageExcessive data deletion may result in an empty training/validation dataset, leading to training task failure:

    image

  • You may encounter the following error log: failed to compose dlc job specs, resource limiting triggered, you are trying to use more GPU resources than the threshold. This indicates that the training job is restricted to 2 simultaneous GPU cores. Exceeding this limit will trigger a resource restriction. Please wait for the ongoing job to complete before starting a new one, or submit a ticket to request an increase in your quota.

  • You may encounter the following error log: the specified vswitch vsw-**** cannot create the required resource ecs.gn7i-c32g1.8xlarge, zone not match. This indicates that some specifications are out of resources in the current zone. Youcan try the following solutions:

    • Do not select a vSwitch. DLC will automatically choose a vSwitch based on inventory.

    • Use other specifications.

How to download trained model?

When creating the training job, you can set the model out path to an OSS path. After training, you can download the trained model from OSS.

image

References