Training and inference for large language models (LLMs) consume significant energy and result in long response times. These limitations hinder LLM deployment in resource-constrained environments. To address them, PAI provides model distillation—a technique that transfers knowledge from a large teacher model to a smaller student model. This preserves most of the original performance while substantially reducing model size and compute resource requirements, enabling broader real-world adoption. This topic walks you through the end-to-end development workflow for LLM data augmentation and model distillation using the Qwen2 LLM.
Workflow
Follow this complete development workflow:
-
Prepare your training dataset according to the required data format and recommended data preparation strategies.
-
Optional: Use an instruction augmentation model
In the PAI-Model Gallery, use a prebuilt instruction augmentation model—Qwen2-1.5B-Instruct-Exp or Qwen2-7B-Instruct-Exp. These models automatically generate additional instructions semantically similar to those in your dataset. Instruction augmentation improves generalization during LLM distillation training.
-
Optional: Use an instruction optimization model
In the PAI-Model Gallery, use a prebuilt instruction optimization model—Qwen2-1.5B-Instruct-Refine or Qwen2-7B-Instruct-Refine. These models refine and enrich instructions from your dataset—including augmented ones—to improve LLM text generation quality.
-
Deploy a teacher LLM to generate responses
In the PAI-Model Gallery, use a prebuilt teacher LLM—Qwen2-72B-Instruct—to generate responses for the instructions in your training dataset. This step distills knowledge from the teacher model into your dataset.
-
Distill-train a smaller student model
In the PAI-Model Gallery, use your completed instruction-response dataset to distill-train a smaller student model suitable for production deployment.
Prerequisites
Before you begin, confirm you have completed these steps:
You have activated the pay-as-you-go billing method for PAI (including DLC and EAS) and created a default workspace. For more information, see Activate PAI and create a default workspace.
You have created an Object Storage Service (OSS) bucket to store training data and resulting model files. For more information about how to create a bucket, see Quick Start.
Prepare instruction data
Use the data preparation strategy and data format requirements to prepare instruction data:
Data preparation strategy
To improve the effectiveness and stability of model distillation, you can use the following strategies to prepare your data:
-
You need to prepare at least several hundred data points. Preparing more data improves model performance.
-
The seed dataset should have a broad and balanced distribution. For example, task scenarios should be diverse, data inputs and outputs should include both short and long examples, and if the dataset contains multiple languages—such as Chinese and English—the language distribution should be relatively balanced.
-
Process abnormal data. Even small amounts can significantly impact fine-tuning results. Use rule-based methods to clean the data and filter out invalid entries.
Data format requirements
Your training dataset must be a JSON file containing a single field: instruction. This field holds the input instruction. Example:
[
{
"instruction": "What major measures did governments take to stabilize financial markets during the 2008 financial crisis?"
},
{
"instruction": "What important actions have governments taken to promote sustainable development amid worsening climate change?"
},
{
"instruction": "What major measures did governments take to support economic recovery during the 2001 tech bubble burst?"
}
]
Optional: Use an instruction augmentation model
Instruction augmentation is a common prompt engineering technique for LLMs. It automatically expands user-provided instruction datasets to increase diversity and coverage.
-
For example, given this input:
How do I cook fish-flavored shredded pork? How do I prepare for the GRE exam? What should I do if a friend misunderstands me? -
The model outputs something like this:
Teach me how to cook mapo tofu. Provide a detailed guide for preparing for the TOEFL exam. If you face setbacks at work, how do you adjust your mindset?
Instruction diversity directly affects LLM generalization. Augmenting instructions improves final student model performance. Based on the Qwen2 base model, PAI provides two proprietary instruction augmentation models: Qwen2-1.5B-Instruct-Exp and Qwen2-7B-Instruct-Exp. You can deploy either as an EAS online service with one click. Follow these steps:
Deploy the model service
Deploy the instruction augmentation model as an EAS online service:
-
Go to the Model Gallery page.
-
Log on to the PAI console.
-
In the top-left corner, select your region.
-
In the navigation pane on the left, choose Workspaces, then click your workspace name to open it.
-
In the navigation pane on the left, choose .
-
-
In the model list on the right side of the Model Gallery page, search for Qwen2-1.5B-Instruct-Exp or Qwen2-7B-Instruct-Exp. Click Deploy on the corresponding card.
-
In the Deploy configuration panel, the system sets default values for Model Service Information and Resource Deployment Information. Modify them as needed. Then click Deploy.
-
In the Billing Notice dialog box, click OK.
The system opens the Deployment Task page. When the Status shows Running, deployment succeeds.
Call the model service
After successful deployment, use the API to run inference. For full usage, see Deploy large language models. Below is an example client request:
-
Get the service endpoint and token.
-
On the Service Details page, click Basic Information, then click View Endpoint Information.

-
In the Endpoint Information dialog box, find the endpoint and token. Save them locally.
-
-
In your terminal, create and run this Python script:
import argparse import json import requests from typing import List def post_http_request(prompt: str, system_prompt: str, host: str, authorization: str, max_new_tokens: int, temperature: float, top_k: int, top_p: float) -> requests.Response: headers = { "User-Agent": "Test Client", "Authorization": f"{authorization}" } pload = { "prompt": prompt, "system_prompt": system_prompt, "top_k": top_k, "top_p": top_p, "temperature": temperature, "max_new_tokens": max_new_tokens, "do_sample": True, "eos_token_id": 151645 } response = requests.post(host, headers=headers, json=pload) return response def get_response(response: requests.Response) -> List[str]: data = json.loads(response.content) output = data["response"] return output if __name__ == "__main__": parser = argparse.ArgumentParser() parser.add_argument("--top-k", type=int, default=50) parser.add_argument("--top-p", type=float, default=0.95) parser.add_argument("--max-new-tokens", type=int, default=2048) parser.add_argument("--temperature", type=float, default=1) parser.add_argument("--prompt", type=str, default="Sing me a song.") args = parser.parse_args() prompt = args.prompt top_k = args.top_k top_p = args.top_p temperature = args.temperature max_new_tokens = args.max_new_tokens host = "EAS HOST" authorization = "EAS TOKEN" print(f" --- input: {prompt}\n", flush=True) system_prompt = "You are an instruction creator. Your goal is to create a new instruction inspired by the [given instruction]." response = post_http_request( prompt, system_prompt, host, authorization, max_new_tokens, temperature, top_k, top_p) output = get_response(response) print(f" --- output: {output}\n", flush=True)Where:
-
host: Set to your service endpoint.
-
authorization: Set to your service token.
-
Batch instruction augmentation
You can use the EAS online service above to batch-process instructions. The following example reads a custom JSON dataset and calls the model API to augment instructions. Create and run this Python script in your terminal:
import requests
import json
import random
from tqdm import tqdm
from typing import List
input_file_path = "input.json" # Input filename
with open(input_file_path) as fp:
data = json.load(fp)
total_size = 10 # Target total number of samples after expansion
pbar = tqdm(total=total_size)
while len(data) < total_size:
prompt = random.sample(data, 1)[0]["instruction"]
system_prompt = "You are an instruction creator. Your goal is to create a new instruction inspired by the [given instruction]."
top_k = 50
top_p = 0.95
temperature = 1
max_new_tokens = 2048
host = "EAS HOST"
authorization = "EAS TOKEN"
response = post_http_request(
prompt, system_prompt,
host, authorization,
max_new_tokens, temperature, top_k, top_p)
output = get_response(response)
temp = {
"instruction": output
}
data.append(temp)
pbar.update(1)
pbar.close()
output_file_path = "output.json" # Output filename
with open(output_file_path, 'w') as f:
json.dump(data, f, ensure_ascii=False)
Where:
-
host: Set to your service endpoint.
-
authorization: Set to your service token.
-
file_path: Replace with the local path to your dataset file.
-
The
post_http_requestandget_responsefunctions match those defined in the Call the model service Python script.
You can also use the LLM-Instruction Expansion (DLC) component in PAI-Designer to achieve this without code. For details, see Custom pipelines. 
Optional: Use an instruction optimization model
Instruction optimization is another common prompt engineering technique for LLMs. It automatically refines user-provided instruction datasets to generate more detailed, structured instructions—leading to richer LLM responses.
-
For example, given this input to the instruction optimization model:
How do I cook fish-flavored shredded pork? How do I prepare for the GRE exam? What should I do if a friend misunderstands me? -
The model outputs something like this:
Provide a detailed Sichuan-style recipe for fish-flavored shredded pork. Include a specific ingredient list—vegetables, pork, and seasonings—along with step-by-step cooking instructions. Also recommend suitable side dishes and staple foods to serve with it. Provide a comprehensive guide covering GRE registration, required documents, study strategies, and recommended review materials. Also suggest effective practice questions and mock exams to help me prepare. Provide a detailed guide on staying calm and rational when misunderstood by a friend—and communicating effectively to resolve it. Include practical advice—for example, how to express your thoughts and feelings, how to avoid escalating misunderstandings, and specific dialogue scenarios and situations for practice.
Instruction detail directly affects LLM output quality. Optimizing instructions improves final student model performance. Based on the Qwen2 base model, PAI provides two proprietary instruction optimization models: Qwen2-1.5B-Instruct-Refine and Qwen2-7B-Instruct-Refine. You can deploy either as an EAS online service with one click. Follow these steps:
Deploy the model service
-
Go to the Model Gallery page.
-
Log on to the PAI console.
-
In the top-left corner, select your region.
-
In the navigation pane on the left, choose Workspaces, then click your workspace name to open it.
-
In the navigation pane on the left, choose .
-
-
In the model list on the right side of the Model Gallery page, search for Qwen2-1.5B-Instruct-Refine or Qwen2-7B-Instruct-Refine. Click Deploy on the corresponding card.
-
In the Deploy configuration panel, the system sets default values for Model Service Information and Resource Deployment Information. Modify them as needed. Then click Deploy.
-
In the Billing Notice dialog box, click OK.
The system opens the Deployment Task page. When the Status shows Running, deployment succeeds.
Call the model service
After successful deployment, use the API to run inference. For full usage, see Deploy large language models. Below is an example client request:
-
Get the service endpoint and token.
-
On the Service Details page, click Basic Information, then click View Endpoint Information.

-
In the Endpoint Information dialog box, find the endpoint and token. Save them locally.
-
-
In your terminal, create and run this Python script:
import argparse import json import requests from typing import List def post_http_request(prompt: str, system_prompt: str, host: str, authorization: str, max_new_tokens: int, temperature: float, top_k: int, top_p: float) -> requests.Response: headers = { "User-Agent": "Test Client", "Authorization": f"{authorization}" } pload = { "prompt": prompt, "system_prompt": system_prompt, "top_k": top_k, "top_p": top_p, "temperature": temperature, "max_new_tokens": max_new_tokens, "do_sample": True, "eos_token_id": 151645 } response = requests.post(host, headers=headers, json=pload) return response def get_response(response: requests.Response) -> List[str]: data = json.loads(response.content) output = data["response"] return output if __name__ == "__main__": parser = argparse.ArgumentParser() parser.add_argument("--top-k", type=int, default=2) parser.add_argument("--top-p", type=float, default=0.95) parser.add_argument("--max-new-tokens", type=int, default=256) parser.add_argument("--temperature", type=float, default=0.5) parser.add_argument("--prompt", type=str, default="Sing me a song.") args = parser.parse_args() prompt = args.prompt top_k = args.top_k top_p = args.top_p temperature = args.temperature max_new_tokens = args.max_new_tokens host = "EAS HOST" authorization = "EAS TOKEN" print(f" --- input: {prompt}\n", flush=True) system_prompt = "Optimize this instruction to make it more detailed and specific." response = post_http_request( prompt, system_prompt, host, authorization, max_new_tokens, temperature, top_k, top_p) output = get_response(response) print(f" --- output: {output}\n", flush=True)Where:
-
host: Set to your service endpoint.
-
authorization: Set to your service token.
-
Batch instruction optimization
You can use the EAS online service above to batch-process instructions. The following example reads a custom JSON dataset and calls the model API to optimize instructions. Create and run this Python script in your terminal:
import requests
import json
import random
from tqdm import tqdm
from typing import List
input_file_path = "input.json" # Input filename
with open(input_file_path) as fp:
data = json.load(fp)
pbar = tqdm(total=len(data))
new_data = []
for d in data:
prompt = d["instruction"]
system_prompt = "Optimize the following instruction."
top_k = 50
top_p = 0.95
temperature = 1
max_new_tokens = 2048
host = "EAS HOST"
authorization = "EAS TOKEN"
response = post_http_request(
prompt, system_prompt,
host, authorization,
max_new_tokens, temperature, top_k, top_p)
output = get_response(response)
temp = {
"instruction": output
}
new_data.append(temp)
pbar.update(1)
pbar.close()
output_file_path = "output.json" # Output filename
with open(output_file_path, 'w') as f:
json.dump(new_data, f, ensure_ascii=False)
Where:
-
host: Set to your service endpoint.
-
authorization: Set to your service token.
-
file_path: Replace with the local path to your dataset file.
-
The
post_http_requestandget_responsefunctions match those defined in the Call the model service Python script.
You can also use the LLM-Instruction Optimization (DLC) component in PAI-Designer to achieve this without code. For details, see Custom pipelines. 
Deploy a teacher LLM to generate responses
Deploy the model service
After optimizing your instruction dataset, deploy a teacher LLM to generate responses. Follow these steps:
-
Go to the Model Gallery page.
-
Log on to the PAI console.
-
In the top-left corner, select your region.
-
In the navigation pane on the left, choose Workspaces, then click your workspace name to open it.
-
In the navigation pane on the left, choose .
-
-
In the model list on the right side of the Model Gallery page, search for Qwen2-72B-Instruct. Click Deploy on the corresponding card.
-
In the Deploy configuration panel, the system sets default values for Model Service Information and Resource Deployment Information. Modify them as needed. Then click Deploy.
-
In the Billing Notice dialog box, click OK.
The system opens the Deployment Task page. When the Status shows Running, deployment succeeds.
Call the model service
After successful deployment, use the API to run inference. For full usage, see Deploy large language models. Below is an example client request:
-
Get the service endpoint and token.
-
On the Service Details page, click Basic Information, then click View Endpoint Information.

-
In the Endpoint Information dialog box, find the endpoint and token. Save them locally.
-
-
In your terminal, create and run this Python script:
import argparse import json import requests from typing import List def post_http_request(prompt: str, system_prompt: str, host: str, authorization: str, max_new_tokens: int, temperature: float, top_k: int, top_p: float) -> requests.Response: headers = { "User-Agent": "Test Client", "Authorization": f"{authorization}" } pload = { "prompt": prompt, "system_prompt": system_prompt, "top_k": top_k, "top_p": top_p, "temperature": temperature, "max_new_tokens": max_new_tokens, "do_sample": True, } response = requests.post(host, headers=headers, json=pload) return response def get_response(response: requests.Response) -> List[str]: data = json.loads(response.content) output = data["response"] return output if __name__ == "__main__": parser = argparse.ArgumentParser() parser.add_argument("--top-k", type=int, default=50) parser.add_argument("--top-p", type=float, default=0.95) parser.add_argument("--max-new-tokens", type=int, default=2048) parser.add_argument("--temperature", type=float, default=0.5) parser.add_argument("--prompt", type=str) parser.add_argument("--system_prompt", type=str) args = parser.parse_args() prompt = args.prompt system_prompt = args.system_prompt top_k = args.top_k top_p = args.top_p temperature = args.temperature max_new_tokens = args.max_new_tokens host = "EAS HOST" authorization = "EAS TOKEN" print(f" --- input: {prompt}\n", flush=True) response = post_http_request( prompt, system_prompt, host, authorization, max_new_tokens, temperature, top_k, top_p) output = get_response(response) print(f" --- output: {output}\n", flush=True)Where:
-
host: Set to your service endpoint.
-
authorization: Set to your service token.
-
Batch teacher model instruction annotation
The following example reads a custom JSON dataset and calls the model API to annotate instructions using the teacher model. Create and run this Python script in your terminal:
import json
from tqdm import tqdm
import requests
from typing import List
input_file_path = "input.json" # Input filename
with open(input_file_path) as fp:
data = json.load(fp)
pbar = tqdm(total=len(data))
new_data = []
for d in data:
system_prompt = "You are a helpful assistant."
prompt = d["instruction"]
print(prompt)
top_k = 50
top_p = 0.95
temperature = 0.5
max_new_tokens = 2048
host = "EAS HOST"
authorization = "EAS TOKEN"
response = post_http_request(
prompt, system_prompt,
host, authorization,
max_new_tokens, temperature, top_k, top_p)
output = get_response(response)
temp = {
"instruction": prompt,
"output": output
}
new_data.append(temp)
pbar.update(1)
pbar.close()
output_file_path = "output.json" # Output filename
with open(output_file_path, 'w') as f:
json.dump(new_data, f, ensure_ascii=False)
Where:
-
host: Set to your service endpoint.
-
authorization: Set to your service token.
-
file_path: Replace with the local path to your dataset file.
-
The
post_http_requestandget_responsefunctions match those defined in the Call the model service script.
Distill-train a smaller student model
Train the model
After obtaining teacher model responses, train a student model in the PAI-Model Gallery—no coding required. This greatly simplifies model development. This solution uses the Qwen2-7B-Instruct model as an example. Follow these steps to train a model using your prepared dataset in the PAI-Model Gallery:
-
Go to the Model Gallery page.
-
Log on to the PAI console.
-
In the top-left corner, select your region.
-
In the navigation pane on the left, choose Workspaces, then click your workspace name to open it.
-
In the navigation pane on the left, choose .
-
-
In the model list on the right side of the Model Gallery page, search for and click the Qwen2-7B-Instruct model card to open its details page.
-
On the model details page, click Fine-tune in the upper-right corner.
-
In the Fine-tune configuration panel, set these key parameters. Leave others at default.
Parameter
Description
Default value
Dataset configuration
Training dataset
Select OSS file or directory from the dropdown. Then select your dataset’s OSS storage path:
-
Click
and select your OSS bucket. -
Click Upload file. Upload your dataset file to the OSS directory using the console guidance.
-
Click OK.
None
Training output configuration
model
Click
, then select your OSS storage directory.None
tensorboard
Click
, then select your OSS storage directory.None
Compute resource configuration
Job resources
Select a resource specification. The system recommends suitable options.
None
Hyperparameter configuration
learning_rate
Learning rate for model training. Type: Float.
5e-5
num_train_epochs
Number of training epochs. Type: INT.
1
per_device_train_batch_size
Number of training samples per GPU per iteration. Type: INT.
1
seq_length
Text sequence length. Type: INT.
128
lora_dim
LoRA dimension. Type: INT. If lora_dim>0, use LoRA/QLoRA lightweight training.
32
lora_alpha
LoRA weight. Type: INT. Takes effect only if lora_dim>0 and LoRA/QLoRA lightweight training is used.
32
load_in_4bit
Whether to load the model in 4-bit mode. Type: bool. Valid values:
-
true
-
false
If lora_dim>0, load_in_4bit is true, and load_in_8bit is false, use 4-bit QLoRA lightweight training.
true
load_in_8bit
Whether to load the model in 8-bit mode. Type: bool. Valid values:
-
true
-
false
If lora_dim>0, load_in_4bit is false, and load_in_8bit is true, use 8-bit QLoRA lightweight training.
false
gradient_accumulation_steps
Number of gradient accumulation steps. Type: INT.
8
apply_chat_template
Whether to combine training data with the default chat template to optimize model output. Type: bool. Valid values:
-
true
-
false
For Qwen2 series models, the format is:
-
Question:
<|im_end|>\n<|im_start|>user\n + instruction + <|im_end|>\n -
Answer:
<|im_start|>assistant\n + output + <|im_end|>\n
true
system_prompt
System prompt used during model training. Type: String.
You are a helpful assistant
-
-
After setting parameters, click Train.
-
In the Billing Notice dialog box, click OK.
The system opens the training task page.
Deploy the model service
After training completes, deploy the model as an EAS online service:
-
On the training task page, click Deploy on the right side.
-
In the deployment configuration panel, the system sets default values for Model Service Information and Resource Deployment Information. Modify them as needed. Then click Deploy.
-
In the Billing Notice dialog box, click OK.
The system opens the Deployment Task page. When the Status shows Running, deployment succeeds.
Call the model service
After successful deployment, use the API to run inference. For full usage, see Deploy large language models.
References
-
For more information about EAS, see EAS overview.
-
Using PAI-Model Gallery, you can easily deploy and fine-tune models for many scenarios—including Llama-3, Qwen1.5, and Stable Diffusion V1.5. For details, see Model Gallery use case collection.