Prefill-decode disaggregation is an important architectural design pattern for the deployment and management of large language models (LLMs). It separates the prefill phase from the decode phase to enhance deployment efficiency. This topic describes how to implement prefill-decode disaggregation deployment.
Limits
Prefill-decode disaggregation only supports images of
blade-llm:0.10.0
and later versions.Prefill-decode disaggregation only supports Lingjun resources of H20 or GP7V types.
NoteCurrently, Lingjun resources are available in China (Ulanqab), Singapore, China (Shenzhen), China (Beijing), China (Shanghai), and China (Hangzhou) regions, exclusively for whitelist users. To use Lingjun resources, contact your account manager.
Prefill-decode disaggregation only supports Qwen models, such as QWQ-32B, Qwen2.5-72B-Instruct, among others.
Concepts
Prefill
The prefill phase primarily handles the initial encoding of the input text and generates the initial hidden state. It often requires significant computational effort, because it encodes the entire input sequence. Caching prefill results can enhance the response speed for future requests.
Decode
The decode phase gradually produces output text from the generated hidden state. Although this phase generates tokens one by one, it can handle multiple requests in parallel. It can also dynamically adjust the token generation length and strategy according to your requirements.
Deploy prefill and decode services
Use one of the following methods:
Scenario-based model deployment
Take the following steps to deploy the prefill and decode services separately, but within the same group. The deployment process for both services is similar, but take note of the following:
Choose service type: When setting the Prefill-Decode Separation parameter, select the appropriate Service Type (prefill or decode).
Configure the run command: In Advanced Settings, specify the command respectively.
Follow these steps:
Log on to the PAI console. Select a region on the top of the page. Then, select the desired workspace and click Enter Elastic Algorithm Service (EAS).
On the Elastic Algorithm Service (EAS) page, click Deploy Service. In the Scenario-based Model Deployment section of the Deploy Service page, select LLM Deployment.
On the LLM Deployment page, configure the following key parameters and click Deploy.
Basic Information:
Parameter
Description
Service Name
Enter a name for the service. For example:
Prefill: qwq_32b_p
Decode: qwq_32b_d
Version
Select High-performance Deployment.
Image Version
Select an image of
blade-llm:0.10.0
or later versions, such asblade-llm:0.10.0-rc12
.Model Settings
Select a model. For example, select the Public Model QwQ-32B-Preview. You can also select other public or custom models.
Resource Deployment
Parameter
Description
Resource Type
Select Resource Quota.
Resource Quota
You need to purchased Lingjun resources and created a resource quota in advance. If no quota is available, click Associate Resource Quota to associate the created quota with the workspace.
Deployment Resources
Set the resources for the service. For example, to deploy QwQ-32B-Preview:
vCPUs: 16
Memory (GB): 125
GPUs: 1
For other models, adjust according to the memory requirements.
Features
Turn on Prefill-decode Separation and set the following parameters:
Parameter
Description
Group
Select New Group or Join. Make sure that the prefill service and decode service are in the same group. Example of the group name: qwq_32b_pd.
Service Type
Select the corresponding service type:
Prefill.
Decode.
RDMA Network
By default, RDMA Network is enabled to ensure efficient network connectivity between machines.
NoteCurrently, only services deployed using Lingjun resources support RDMA network.
Environment Variables
Use the default value,
ENABLE_MESSAGE_BUS
:on
.Advanced Settings
Click Switch to Free Edit Mode and configure Command Preview based on the type of service:
Prefill
blade_llm_server --disable_prompt_cache --disable_cuda_graph --ragged_flash_max_batch_tokens=8000 --metric_export_interval_sec=5 --port 8001 -tp 1 --model /mnt/bladellm/model --enable_disagg --metric_exporters logger eas --disable_frontend_multiprocessing --disagg_pd.disagg_transfer_type rdma --disagg_pd.select_decode_max_batched 5 --disagg_pd.token_port 10030 --disagg_pd.inst_role prefill --naming_url eas:http://127.0.0.1:9900
Decode
blade_llm_server --disable_prompt_cache --disable_cuda_graph --ragged_flash_max_batch_tokens=8000 --metric_export_interval_sec=5 --port 8001 -tp 1 --model /mnt/bladellm/model --enable_disagg --metric_exporters logger eas --disable_frontend_multiprocessing --disagg_pd.disagg_transfer_type rdma --disagg_pd.select_decode_max_batched 5 --disagg_pd.token_port 10030 --disagg_pd.inst_role decode --naming_url eas:http://127.0.0.1:9900
VPC
Select the virtual private cloud (VPC), vSwitch, and security group that align with the Lingjun resources. To view these, go to the Resource Quota page, click the desired quota name, and find the details in the Network Information section.
JSON deployment
Take the following steps to deploy the prefill and decode services separately, but within the same group.
Log on to the PAI console. Select a region on the top of the page. Then, select the desired workspace and click Enter Elastic Algorithm Service (EAS).
On the Model Online Service (EAS) page, click Deploy Service. Then, click JSON Deployment in the Custom Model Deployment section.
On the JSON Deployment page, update the relevant parameters based on the type of service. Paste the corresponding JSON content into the editor, and then click Deploy.
Here are the key parameters:
Parameter
Description
cloud
networking
Set to the VPC (vpc_id), vSwitch (vswitch_id), and security group (security_group_id) that aligns with the Lingjun resources. To view these, go to the Resource Quota page, click the desired quota name, and find the details in the Network Information section.
containers
image
The image address used for the service. Replace the region information in the image address with the actual region ID. For example, China (Ulanqab) is cn-wulanchabu.
script
The service startup command. Set the command according to the service type:
Prefill
gpu_count=$(nvidia-smi --query-gpu=count --format=csv,noheader | wc -l); blade_llm_server --model /model_dir --host 0.0.0.0 --port 8001 -tp $gpu_count --metric_exporters logger eas --enable_disagg --disable_frontend_multiprocessing --disagg_pd.disagg_transfer_type rdma --disagg_pd.select_decode_max_batched 5 --disagg_pd.token_port 10030 --disagg_pd.inst_role prefill --naming_url eas:http://127.0.0.1:9900
Decode
gpu_count=$(nvidia-smi --query-gpu=count --format=csv,noheader | wc -l); blade_llm_server --model /model_dir --host 0.0.0.0 --port 8001 -tp $gpu_count --metric_exporters logger eas --enable_disagg --disable_frontend_multiprocessing --disagg_pd.disagg_transfer_type rdma --disagg_pd.select_decode_max_batched 5 --disagg_pd.token_port 10030 --disagg_pd.inst_role decode --naming_url eas:http://127.0.0.1:9900
metadata
group
The name of the service group. Make sure that the prefill service and decode service are in the same group.
name
The name of the service.
quota_id
The Lingjun resource quota ID. Go to the Resource Quota page to view it.
rpc.infer_mode
Set to the service type:
prefill
decode
workspace_id
The workspace ID. Go to the Workspace Details page to view it.
cpu
The resources used for the service. For example, to deploy QwQ-32B-Preview:
cpu: 16
memory (MB): 128000
gpu: 1
For other models, adjust according to the memory requirements.
gpu
memory
storage
oss.path
The OSS path where the model is located. For example, set to
oss://pai-quickstart-cn-wulanchabu/modelscope/models/QwQ-32B/
for public models. You can also use other public models or your own fine-tuned models.oss.endpoint
Here are the samples for the prefill and decode services. You need to update the relevant parameters as described in the table above.
Prefill
{ "metadata": { "resource_burstable": false, "instance": 1, "rdma": 1, "rpc": { "infer_mode": "prefill" }, "name": "qwq_32b_p", "group": "qwq_32b_pd", "quota_id": "quota1s47n5z****", "quota_type": "Lingjun", "workspace_id": "3**", "cpu": 16, "memory": 128000, "gpu": 1 }, "containers": [ { "image": "eas-registry-vpc.cn-wulanchabu.cr.aliyuncs.com/pai-eas/blade-llm:0.10.0rc12", "port": 8001, "env": [ { "name": "ENABLE_MESSAGE_BUS", "value": "on" } ], "script": "gpu_count=$(nvidia-smi --query-gpu=count --format=csv,noheader | wc -l); blade_llm_server --model /model_dir --host 0.0.0.0 --port 8001 -tp $gpu_count --metric_exporters logger eas --enable_disagg --disable_frontend_multiprocessing --disagg_pd.disagg_transfer_type rdma --disagg_pd.select_decode_max_batched 5 --disagg_pd.token_port 10030 --disagg_pd.inst_role prefill --naming_url eas:http://127.0.0.1:9900" } ], "storage": [ { "mount_path": "/model_dir/", "properties": { "resource_type": "model", "resource_use": "base" }, "oss": { "path": "oss://pai-quickstart-cn-wulanchabu/modelscope/models/QwQ-32B/", "endpoint": "oss-cn-wulanchabu-internal.aliyuncs.com" } } ], "options": { "priority": 9 }, "cloud": { "networking": { "vpc_id": "vpc-0jl65jioii2v72bh9****", "vswitch_id": "vsw-0jlh0drtahzsooq3q****", "security_group_id": "sg-0jlcs30nnyf50o2x****" } } }
Decode
{ "metadata": { "resource_burstable": false, "instance": 1, "rdma": 1, "rpc": { "infer_mode": "decode" }, "name": "qwq_32b_d", "group": "qwq_32b_pd", "quota_id": "quota1s47n5z****", "quota_type": "Lingjun", "workspace_id": "3***", "cpu": 16, "memory": 128000, "gpu": 1 }, "containers": [ { "image": "eas-registry-vpc.cn-wulanchabu.cr.aliyuncs.com/pai-eas/blade-llm:0.10.0rc12", "port": 8001, "env": [ { "name": "ENABLE_MESSAGE_BUS", "value": "on" } ], "script": "gpu_count=$(nvidia-smi --query-gpu=count --format=csv,noheader | wc -l); blade_llm_server --model /model_dir --host 0.0.0.0 --port 8001 -tp $gpu_count --metric_exporters logger eas --enable_disagg --disable_frontend_multiprocessing --disagg_pd.disagg_transfer_type rdma --disagg_pd.select_decode_max_batched 5 --disagg_pd.token_port 10030 --disagg_pd.inst_role decode --naming_url eas:http://127.0.0.1:9900" } ], "storage": [ { "mount_path": "/model_dir/", "properties": { "resource_type": "model", "resource_use": "base" }, "oss": { "path": "oss://pai-quickstart-cn-wulanchabu/modelscope/models/QwQ-32B/", "endpoint": "oss-cn-wulanchabu-internal.aliyuncs.com" } } ], "options": { "priority": 9 }, "cloud": { "networking": { "vpc_id": "vpc-0jl65jioii2v72bh9****", "vswitch_id": "vsw-0jlh0drtahzsooq3q****", "security_group_id": "sg-0jlcs30nnyf50o2x****" } } }
Model gallery one-click deployment
Deploy the prefill and decode services separately in Model Gallery. Take QwQ-32B as an example:
Parameter | Description | |
Deployment Method | Select Bladellm Accelerated Deployment, and choose according to the service type:
| |
Basic Information | Service Name | Enter a name for the service. For example:
|
Group | Select New Group or Join. Make sure that the prefill service and decode service are in the same group. Example of group name: qwq_32b_pd. | |
Resource Deployment | Resource Type | Select Resource Quota. |
Resource Quota | You need to purchased Lingjun resources and created a resource quota in advance. If no quota is available, click Associate Resource Quota to associate the created quota with the workspace. | |
Deployment Resources | Set the resources required for the service. For example, to deploy QwQ-32N:
For other models, adjust according to the memory requirements. | |
VPC | VPC | Select the virtual private cloud (VPC), vSwitch, and security group that align with the Lingjun resources. To view these, go to the Resource Quota page, click the desired quota name, and find the details in the Network Information section. |
vSwitch | ||
Security Group Name |
Send requests
To send requests, you just need to call the prefill service. On the Elastic Algorithm Service (EAS) page, click Invocation Method of the prefill service in the Service Type column. View the endpoint and token of the prefill service.
You can choose to use the public endpoint or the VPC endpoint. If using the VPC endpoint, make sure that your client is within the same VPC as the service.
Sample request:
curl -v <service_url>/v1/chat/completions \ -H "Authorization: <token>" \ -H "Content-Type: application/json" \ -d '{ "messages": [ { "role": "system", "content": "You are a helpful assistant." }, { "role": "user", "content": "Hello!" } ], "max_tokens": 10, "stream": true }'
Where:
<service_url>: Replace with the endpoint. Example:
http://**********.cn-wulanchabu.pai-eas.aliyuncs.com/api/predict/qwq_32b_pd.qwq_32b_p
.<token>: Replace with the token.
Sample response:
data: {"id":"2d14****-697b-43f3-9d62-1bb4dde6****","choices":[{"finish_reason":"","index":0,"logprobs":null,"delta":{"role":"assistant","content":"Hello"}}],"object":"chat.completion.chunk","usage":{"prompt_tokens":21,"completion_tokens":1,"total_tokens":22},"error_info":null} data: {"id":"2d14****-697b-43f3-9d62-1bb4dde6****","choices":[{"finish_reason":"","index":0,"logprobs":null,"delta":{"content":" there"}}],"object":"chat.completion.chunk","usage":{"prompt_tokens":21,"completion_tokens":2,"total_tokens":23},"error_info":null} data: {"id":"2d14****-697b-43f3-9d62-1bb4dde6****","choices":[{"finish_reason":"","index":0,"logprobs":null,"delta":{"content":"!"}}],"object":"chat.completion.chunk","usage":{"prompt_tokens":21,"completion_tokens":3,"total_tokens":24},"error_info":null} data: {"id":"2d14****-697b-43f3-9d62-1bb4dde6****","choices":[{"finish_reason":"","index":0,"logprobs":null,"delta":{"content":" How"}}],"object":"chat.completion.chunk","usage":{"prompt_tokens":21,"completion_tokens":4,"total_tokens":25},"error_info":null} data: {"id":"2d14****-697b-43f3-9d62-1bb4dde6****","choices":[{"finish_reason":"","index":0,"logprobs":null,"delta":{"content":" can"}}],"object":"chat.completion.chunk","usage":{"prompt_tokens":21,"completion_tokens":5,"total_tokens":26},"error_info":null} data: {"id":"2d14****-697b-43f3-9d62-1bb4dde6****","choices":[{"finish_reason":"","index":0,"logprobs":null,"delta":{"content":" I"}}],"object":"chat.completion.chunk","usage":{"prompt_tokens":21,"completion_tokens":6,"total_tokens":27},"error_info":null} data: {"id":"2d14****-697b-43f3-9d62-1bb4dde6****","choices":[{"finish_reason":"","index":0,"logprobs":null,"delta":{"content":" assist"}}],"object":"chat.completion.chunk","usage":{"prompt_tokens":21,"completion_tokens":7,"total_tokens":28},"error_info":null} data: {"id":"2d14****-697b-43f3-9d62-1bb4dde6****","choices":[{"finish_reason":"","index":0,"logprobs":null,"delta":{"content":" you"}}],"object":"chat.completion.chunk","usage":{"prompt_tokens":21,"completion_tokens":8,"total_tokens":29},"error_info":null} data: {"id":"2d14****-697b-43f3-9d62-1bb4dde6****","choices":[{"finish_reason":"","index":0,"logprobs":null,"delta":{"content":" today"}}],"object":"chat.completion.chunk","usage":{"prompt_tokens":21,"completion_tokens":9,"total_tokens":30},"error_info":null} data: {"id":"2d14****-697b-43f3-9d62-1bb4dde6****","choices":[{"finish_reason":"","index":0,"logprobs":null,"delta":{"content":"?"}}],"object":"chat.completion.chunk","usage":{"prompt_tokens":21,"completion_tokens":10,"total_tokens":31},"error_info":null} data: {"id":"2d14****-697b-43f3-9d62-1bb4dde6****","choices":[{"finish_reason":"length","index":0,"logprobs":null,"delta":{"content":""}}],"object":"chat.completion.chunk","usage":{"prompt_tokens":21,"completion_tokens":11,"total_tokens":32},"error_info":null} data: [DONE]
Performance testing
The following results are for reference only. Your actual results may vary
Taking the evaluation of Qwen2.5-72B-Instruct as an example. The model is deployed using the Lingjun multi-tenant GP7V instance type, with an average ratio of 1100:140
for input to output tokens. Here are the performance improvements:
Configuration | QPS improvement | TPS improvement |
2 Prefill + 1 Decode | 25.5% | 7.1% |
3 Prefill + 2 Decode | 20.7% | 14.3% |
Based on different business attributes and TPS targets, the optimal ratio of Prefill to Decode:
TPS target | PD ratio | Example configuration |
20 | Prefill : Decode = 1: 2.3 | 10 Prefill and 23 Decode |
15 | Prefill : Decode = 1.4:1 | 14 Prefill and 10 Decode |