PAI-EasyDistill Overview: Lightweight Model Distillation for AI - Platform For AI - Alibaba Cloud - Platform For AI

Use cases

Suitable scenarios:

On-device or edge deployment: Compress large models into lightweight models for resource-constrained environments such as mobile phones and IoT devices.
Cost optimization: When inference costs for online services are too high, distill a smaller model to reduce costs.
Inference acceleration: For latency-sensitive applications, reduce latency and GPU resource consumption while maintaining accuracy. This improves service throughput.
Domain knowledge inheritance: Transfer domain-specific knowledge (such as healthcare or law) from a large model to a smaller, more cost-effective model.

Scenarios that may not be suitable:

100% performance fidelity: Distillation always involves some performance loss. Do not use distillation if any degradation is unacceptable.
Overly simple tasks: For simple tasks such as classification or text matching, training a small model directly may be more cost-effective than distillation.
Lack of high-quality seed data: Distillation effectiveness depends heavily on seed data quality. If seed data deviates from the actual business scenario, results may be poor.

How it works

PAI Model Gallery uses black box distillation, which is essentially generative data augmentation. The teacher model generates high-quality labeled data, which is then used for Supervised Fine-Tuning (SFT) on the student model.

The workflow is as follows:

Distillation data construction:
1. Prepare base data: Use a public dataset or a custom dataset.
2. (Optional) Data synthesis: PAI-EasyDistill provides data synthesis and augmentation capabilities to expand or optimize data. For non-reasoning models, perform instruction augmentation (expand and rewrite instructions for greater diversity) and instruction optimization (refine phrasing for clarity). For reasoning models, perform chain-of-thought abbreviation and chain-of-thought expansion.
3. Teacher model inference: The teacher model performs inference on the data to generate distillation data for training the student model.
Student model training: Use the generated distillation dataset to train the student model through SFT.

Quick start

Complete a model distillation task with a public dataset and default configurations.

Log on to the PAI console and, in the left-side navigation pane, choose QuickStart > Model Gallery.
On the Models page, use the filter to find a model that supports distillation, such as Qwen3-32B.

In the Supported Operations filter, select Distill. The system displays all teacher models that support distillation.
Click the model card to open the model details page. In the upper-right corner, click Distill to open the task creation page. Configure the following key parameters and keep defaults for other parameters.
1. Basic configuration: Set Model output path to an OSS path that you can access, such as oss://mybucket.oss-cn-hangzhou-internal.aliyuncs.com/model-distillation/model.
  
  Note
  Make sure to create a separate output directory for each distillation task to prevent model files from being overwritten. The OSS Bucket must be in the same region as the PAI service.
2. Build distillation data:
  - Dataset: Select Public Dataset and choose Chinese-medical-dialogue-data from the drop-down list.
  - Distillation output path: Set the path to an OSS path that you can access, such as oss://mybucket.oss-cn-hangzhou-internal.aliyuncs.com/model-distillation/dist-data.
  - Computing Resources: Set Source to Public Resources. Keep the default recommended Instance Type for Job Resource.
3. Student Model Training:
  - Student model config: Select Public Model and choose a smaller model from the drop-down list, such as Qwen3-4B.
  - Training Mode: Keep the default LoRA fine-tuning.
  - Computing Resources: Set Source to Public Resources. Keep the default recommended Instance Type for Job Resource.
Click Distill. In the billing reminder dialog box that appears, click OK. The page automatically redirects to the task details page, where you can track the task status.

The task takes approximately 20 to 40 minutes for a small dataset (100 to 1,000 entries).

Limitations

Model limits:
- Teacher model: Select from the models that support distillation in Model Gallery. The supported models are listed in the console.
- Student model: Use a public model from Model Gallery or a custom model. Custom models are limited to LLMs fine-tuned in Model Gallery. The student model must have fewer parameters than the teacher model.
Distillation method: Only black box distillation (based on SFT) is currently supported.
Dataset: Public distillation datasets and custom datasets are supported. The data must be in JSON format and must include an instruction field (the question column). An output field (the output column) is not required.
```
[
    {
        "instruction": "What is the capital of China?"
    },
    {
        "instruction": "Please explain what artificial intelligence is."
    }
]
```
Student model training method: Only SFT is supported, including LoRA, QLoRA, and full-parameter fine-tuning.

Configuration details

Distillation data construction

Dataset configuration

Data synthesis is an optional feature that transforms the original dataset to improve diversity and quality. Select a data augmentation strategy based on the teacher model type.

Synthesize instruction data (for general instruction models):

Enhance instructions: Expands and rewrites original instructions to produce more diverse styles.

Input example:
{"instruction": "Create a two-day travel guide for Hangzhou for me."}
Output examples:
{"instruction": "Create a three-day travel guide for Beijing for me."}
{"instruction": "I want to visit Shanghai. Recommend a travel itinerary for me."}

Optimize instructions: Optimizes the wording of instructions to make them clearer and easier for the model to understand.

Input example:
{"instruction": "Create a two-day travel guide for Hangzhou for me."}
Output example:
{"instruction": "Please create a two-day travel guide for Hangzhou for me. It should include the itinerary, food recommendations, accommodation suggestions, and the best time to travel."}

Synthesize chain-of-thought data (for inference models):

Chain-of-thought expansion: Adds detailed reasoning steps (Chain of Thought) to the original Q&A pair to enhance the student model's reasoning ability.

Input example:
{"instruction": "John has 3 apples and eats 1. How many apples are left?", "output": "<think>short chain of thought</think> <output>John has 2 apples left</output>"}
Output example:
{"instruction": "John has 3 apples and eats 1. How many apples are left?", "output": "<think>long chain of thought</think> <output>John has 2 apples left</output>"}

Chain-of-thought abbreviation: Simplifies lengthy chain-of-thought processes to improve inference efficiency.

Input example:
{"instruction": "John has 3 apples and eats 1. How many apples are left?", "output": "<think>long chain of thought</think> <output>John has 2 apples left</output>"}
Output example:
{"instruction": "John has 3 apples and eats 1. How many apples are left?", "output": "<think>short chain of thought</think> <output>John has 2 apples left</output>"}

When performing distillation for the first time, do not enable data synthesis. This helps establish a baseline. After you are familiar with the process, you can try enabling different synthesis options and evaluate their impact on the final results.

Hyperparameter configuration

These hyperparameters control the generation behavior of the teacher model during the data construction stage.

Parameter	Description	Tuning suggestions
Inference parameters (applied to teacher model inference)
`temperature`	Controls the randomness of the generated text. Range: [0, 2]. Default value: 0.8.	• Increase (e.g., > 1.0): For more diverse and creative output. Suitable for scenarios that require varied response styles. • Decrease (e.g., < 0.5): For more deterministic and conservative output. Suitable for scenarios that require factual accuracy and a single correct answer, such as mathematical calculations. • The default value of 0.8 provides a balance between diversity and accuracy and is suitable for most general use cases.
`max_length`	The maximum number of tokens that can be input to the teacher model, including the `instruction`. Any excess is truncated. Default value: 512.	• Ensure this value is greater than the length of the longest input in the dataset to prevent information loss. • Note: This parameter affects only the data construction stage and is different from the `seq_length` parameter used in the subsequent student model training.
`max_new_tokens`	The maximum number of new tokens (the `output`) that the teacher model can generate. Default value: 128.	• Set this value to be greater than the average length of the answers you expect the teacher model to generate. Otherwise, the model truncates the answers. • For tasks that require detailed reasoning (such as chain of thought) or long text generation, consider increasing this value to 512 or 1024. • Cost Warning: Increasing this value significantly increases the computation time and cost of the data construction stage.
Feature control parameters (displayed only when instruction augmentation is enabled)
`num_augment_samples`	The number of samples generated for each piece of data during instruction augmentation. Default value: 0.	• Increase (e.g., > 5): To generate more diverse instructions. This significantly increases data volume and computation costs. • Decrease (e.g., 1-2): To generate fewer augmented samples. This is suitable for large datasets or limited computing resources. • The default value of 0 means no instruction augmentation is performed. We recommend that you keep the default value for your first run to establish a baseline before making adjustments.
`num_in_context_samples`	The number of in-context samples used during instruction augmentation. Default value: 3.	• Increase (e.g., > 5): To generate instructions that are more semantically aligned with the context, but this may reduce diversity and increase computational overhead. • Decrease (e.g., 1-2): To generate more random instruction variations, which increases diversity but may sacrifice relevance. • The default value of 3 provides a balance between semantic relevance and diversity and is suitable for most general use cases.

Distillation data confirmation

The system processes the distilled data after the data build is complete.

Auto-confirm (Recommended): After data construction is complete, the system automatically validates and approves the data, and then starts the student model training. This is suitable for scenarios where you use a public dataset or are confident in the data quality.
Manually confirm: After data construction is complete, you need to manually review the data quality. This is suitable for scenarios with extremely high data quality requirements or when using a custom dataset for the first time. The procedure is as follows:
1. After the distillation data construction is complete, the task status in the distillation task list changes to Waiting for distillation data confirmation.
2. In the Actions column, click the Confirm distillation data button. The system displays a confirmation dialog box.
3. In the dialog box, click Click to view to see the details of the distillation dataset.
  - If the data quality meets your expectations, click Confirm to proceed with student model training.
  - If the data does not meet your expectations, click Cancel. The task will stop at the current stage.

Student model training

Training method

Select the specific method for performing SFT on the student model.

LoRA (Recommended): A parameter-efficient fine-tuning method that significantly reduces the required GPU memory while maintaining good performance.
QLoRA: A quantized version of LoRA that further reduces GPU memory usage. It is suitable for training in more resource-constrained environments.
Full-parameter fine-tuning: Updates all model parameters. This method theoretically yields the best results but requires enormous compute resources and time, making it extremely costly.

Validation set configuration

Select a validation dataset to evaluate the model's performance during training.

Don't configure: No validation is performed during training. This is suitable for quick experiments or scenarios with small data volumes.
Auto-split distillation data (Recommended): The system automatically splits the generated distillation dataset into a training set and a validation set. By default, 5% of the data is used for the validation set.
Add validation dataset: Specify a separate OSS dataset as the validation set. This is suitable for scenarios with a standard validation set.

Hyperparameter configuration

These hyperparameters control the SFT process for the student model.

Parameter	Description	Tuning suggestions
`learning_rate`	The learning rate, which controls the step size of model parameter updates. Default value: 5e-5.	• If the training loss (Loss) decreases slowly, try increasing the learning rate. • If the Loss fluctuates sharply or does not converge, decrease the learning rate.
`num_train_epochs`	The number of training epochs, which is the number of times the training process iterates over the entire dataset. Default value: 1.	• For large datasets, one to three epochs are usually sufficient. • Too many epochs may lead to overfitting.
`per_device_train_batch_size`	The number of samples processed in a single training step on each GPU. Default value: 1.	• If GPU memory allows, increasing this value can speed up and potentially stabilize the training process. • If you encounter an out of memory (OOM) error, decrease this value first.
`seq_length`	The maximum sequence length (number of tokens) for the input to the student model. Default value: 128.	• Set this value based on your data characteristics and application scenario. Longer text requires a larger `seq_length`. • Increasing this value significantly increases GPU memory consumption.

Best practices and tuning

Cost optimization strategies

Resource selection:
- Testing/Small tasks: Use Public Resources for pay-as-you-go flexibility.
- Production/High-priority tasks: Use a resource quota to ensure resource stability and cost control.
- Cost-sensitive/Fault-tolerant tasks: Try preemptible resources to obtain computing power at a lower price, but you must accept the risk that the task may be interrupted.
Start with a small dataset: Before performing large-scale distillation, run the process with a small portion of data (100 to 1,000 entries). This verifies that the configuration is correct and avoids wasting compute resources.
Optimize the dataset: A smaller but higher-quality seed dataset reduces teacher model inference overhead and student model training time.
Choose QLoRA for training: QLoRA significantly reduces GPU memory requirements by training on a quantized model. This lets you train on lower-spec (and cheaper) GPUs.
Set a reasonable runtime: Set a reasonable Maximum Running Time for the task to prevent it from running for too long and incurring unnecessary costs due to unexpected issues.

Performance tuning guide

Data quality is key: The teacher model sets the upper limit of distillation, but data quality sets the lower limit. Ensure the raw dataset covers your business scenarios, has clear instructions, and contains diverse content.
Hyperparameter tuning:
- If the student model's training loss does not decrease or diverges, prioritize decreasing the learning rate (learning_rate).
- If you encounter an out-of-memory (OOM) error, prioritize decreasing the batch size (per_device_train_batch_size) or sequence length (seq_length).
Monitor GPU utilization: In Task Monitoring, check the GPU utilization. If the utilization is too low, it indicates that the GPU resources are not fully utilized. You can increase per_device_train_batch_size to improve efficiency.

Performance evaluation

After distillation, evaluate the student model's performance.

Deploy the model: On the distillation task details page, click the Deploy button in the upper-right corner to deploy the distilled student model as an online inference service.
Comparative evaluation: Use PAI model evaluation to compare performance metrics (such as accuracy and BLEU) of the teacher and student models on the same evaluation set to quantify performance loss.
Production validation: Before official release, perform a small-scale grayscale test to compare the inference speed, resource usage, and business impact of the models before and after distillation in a real business scenario.

FAQ

Q: Teacher and student model selection

Teacher model: Choose a large-parameter model that performs well in your business scenario. More parameters generally means more knowledge and higher distillation potential.
Student model: Choose a model from the same series as the teacher but with fewer parameters. For example, distill from Qwen-30B to Qwen-8B. This ensures architectural compatibility and maximizes knowledge transfer efficiency.

Q: Distillation task duration

Duration depends on dataset size, model scale, training method, and compute resources. A task with a small dataset (a few thousand entries) and a medium-sized model (7B-14B) typically takes tens of minutes to a few hours.

Q: Resuming an interrupted task

No. If a task is interrupted because it reached the configured Maximum Running Time, it fails and cannot be resumed. Analyze the task logs to determine whether slow runtime was caused by insufficient resources or a time limit that was too short. Then increase the maximum runtime or use higher-spec resources and resubmit.

Q: Resolving the `CUDA out of memory` error

This is a typical GPU out-of-memory error. Troubleshoot as follows:

Decrease the batch size: In the Student model training configuration, find the Hyperparameters section. Halve the value of per_device_train_batch_size (for example, from 2 to 1) and retry.
Switch the training method: If decreasing the batch size does not work, return to the configuration page and switch the Training Method from LoRA or Full-parameter fine-tuning to QLoRA. QLoRA can significantly reduce GPU memory usage.
Upgrade the GPU specification: If the preceding methods do not work, the model and data scale require higher-spec hardware. In the Computing Resources section, select a GPU with more memory.

Use cases

How it works

Quick start

Limitations

Configuration details

Distillation data construction

Dataset configuration

Hyperparameter configuration

Distillation data confirmation

Student model training

Training method

Validation set configuration

Hyperparameter configuration

Best practices and tuning

Cost optimization strategies

Performance tuning guide

Performance evaluation

FAQ

Q: Teacher and student model selection

Q: Distillation task duration

Q: Resuming an interrupted task

Q: Resolving the CUDA out of memory error

Q: Resolving the `CUDA out of memory` error