All Products
Search
Document Center

Platform For AI:Prefill-decode disaggregation deployment of LLM services

Last Updated:Mar 18, 2025

Prefill-decode disaggregation is an important architectural design pattern for the deployment and management of large language models (LLMs). It separates the prefill phase from the decode phase to enhance deployment efficiency. This topic describes how to implement prefill-decode disaggregation deployment.

Limits

  • Prefill-decode disaggregation only supports images of blade-llm:0.10.0 and later versions.

  • Prefill-decode disaggregation only supports Lingjun resources of H20 or GP7V types.

    Note

    Currently, Lingjun resources are available in China (Ulanqab), Singapore, China (Shenzhen), China (Beijing), China (Shanghai), and China (Hangzhou) regions, exclusively for whitelist users. To use Lingjun resources, contact your account manager.

  • Prefill-decode disaggregation only supports Qwen models, such as QWQ-32B, Qwen2.5-72B-Instruct, among others.

Concepts

  • Prefill

    The prefill phase primarily handles the initial encoding of the input text and generates the initial hidden state. It often requires significant computational effort, because it encodes the entire input sequence. Caching prefill results can enhance the response speed for future requests.

  • Decode

    The decode phase gradually produces output text from the generated hidden state. Although this phase generates tokens one by one, it can handle multiple requests in parallel. It can also dynamically adjust the token generation length and strategy according to your requirements.

Deploy prefill and decode services

Use one of the following methods:

Scenario-based model deployment

Take the following steps to deploy the prefill and decode services separately, but within the same group. The deployment process for both services is similar, but take note of the following:

  • Choose service type: When setting the Prefill-Decode Separation parameter, select the appropriate Service Type (prefill or decode).

  • Configure the run command: In Advanced Settings, specify the command respectively.

Follow these steps:

  1. Log on to the PAI console. Select a region on the top of the page. Then, select the desired workspace and click Enter Elastic Algorithm Service (EAS).

  2. On the Elastic Algorithm Service (EAS) page, click Deploy Service. In the Scenario-based Model Deployment section of the Deploy Service page, select LLM Deployment.

  3. On the LLM Deployment page, configure the following key parameters and click Deploy.

    • Basic Information:

      Parameter

      Description

      Service Name

      Enter a name for the service. For example:

      • Prefill: qwq_32b_p

      • Decode: qwq_32b_d

      Version

      Select High-performance Deployment.

      Image Version

      Select an image of blade-llm:0.10.0 or later versions, such as blade-llm:0.10.0-rc12.

      Model Settings

      Select a model. For example, select the Public Model QwQ-32B-Preview. You can also select other public or custom models.

    • Resource Deployment

      Parameter

      Description

      Resource Type

      Select Resource Quota.

      Resource Quota

      You need to purchased Lingjun resources and created a resource quota in advance. If no quota is available, click Associate Resource Quota to associate the created quota with the workspace.

      Deployment Resources

      Set the resources for the service. For example, to deploy QwQ-32B-Preview:

      • vCPUs: 16

      • Memory (GB): 125

      • GPUs: 1

      For other models, adjust according to the memory requirements.

    • Features

      Turn on Prefill-decode Separation and set the following parameters:image

      Parameter

      Description

      Group

      Select New Group or Join. Make sure that the prefill service and decode service are in the same group. Example of the group name: qwq_32b_pd.

      Service Type

      Select the corresponding service type:

      • Prefill.

      • Decode.

      RDMA Network

      By default, RDMA Network is enabled to ensure efficient network connectivity between machines.

      Note

      Currently, only services deployed using Lingjun resources support RDMA network.

      Environment Variables

      Use the default value, ENABLE_MESSAGE_BUS:on.

    • Advanced Settings

      Click Switch to Free Edit Mode and configure Command Preview based on the type of service:

      • Prefill

        blade_llm_server --disable_prompt_cache --disable_cuda_graph --ragged_flash_max_batch_tokens=8000 --metric_export_interval_sec=5 --port 8001 -tp 1 --model /mnt/bladellm/model --enable_disagg --metric_exporters logger eas --disable_frontend_multiprocessing --disagg_pd.disagg_transfer_type rdma --disagg_pd.select_decode_max_batched 5 --disagg_pd.token_port 10030 --disagg_pd.inst_role prefill  --naming_url eas:http://127.0.0.1:9900
      • Decode

        blade_llm_server --disable_prompt_cache --disable_cuda_graph --ragged_flash_max_batch_tokens=8000 --metric_export_interval_sec=5 --port 8001 -tp 1 --model /mnt/bladellm/model --enable_disagg --metric_exporters logger eas --disable_frontend_multiprocessing --disagg_pd.disagg_transfer_type rdma --disagg_pd.select_decode_max_batched 5 --disagg_pd.token_port 10030 --disagg_pd.inst_role decode --naming_url eas:http://127.0.0.1:9900
    • VPC

      Select the virtual private cloud (VPC), vSwitch, and security group that align with the Lingjun resources. To view these, go to the Resource Quota page, click the desired quota name, and find the details in the Network Information section.

JSON deployment

Take the following steps to deploy the prefill and decode services separately, but within the same group.

  1. Log on to the PAI console. Select a region on the top of the page. Then, select the desired workspace and click Enter Elastic Algorithm Service (EAS).

  2. On the Model Online Service (EAS) page, click Deploy Service. Then, click JSON Deployment in the Custom Model Deployment section.

  3. On the JSON Deployment page, update the relevant parameters based on the type of service. Paste the corresponding JSON content into the editor, and then click Deploy.

    Here are the key parameters:

    Parameter

    Description

    cloud

    networking

    Set to the VPC (vpc_id), vSwitch (vswitch_id), and security group (security_group_id) that aligns with the Lingjun resources. To view these, go to the Resource Quota page, click the desired quota name, and find the details in the Network Information section.

    containers

    image

    The image address used for the service. Replace the region information in the image address with the actual region ID. For example, China (Ulanqab) is cn-wulanchabu.

    script

    The service startup command. Set the command according to the service type:

    • Prefill

      gpu_count=$(nvidia-smi --query-gpu=count --format=csv,noheader | wc -l); blade_llm_server --model /model_dir --host 0.0.0.0 --port 8001 -tp $gpu_count --metric_exporters logger eas --enable_disagg --disable_frontend_multiprocessing --disagg_pd.disagg_transfer_type rdma --disagg_pd.select_decode_max_batched 5 --disagg_pd.token_port 10030 --disagg_pd.inst_role prefill --naming_url eas:http://127.0.0.1:9900
    • Decode

      gpu_count=$(nvidia-smi --query-gpu=count --format=csv,noheader | wc -l); blade_llm_server --model /model_dir --host 0.0.0.0 --port 8001 -tp $gpu_count --metric_exporters logger eas --enable_disagg --disable_frontend_multiprocessing --disagg_pd.disagg_transfer_type rdma --disagg_pd.select_decode_max_batched 5 --disagg_pd.token_port 10030 --disagg_pd.inst_role decode --naming_url eas:http://127.0.0.1:9900

    metadata

    group

    The name of the service group. Make sure that the prefill service and decode service are in the same group.

    name

    The name of the service.

    quota_id

    The Lingjun resource quota ID. Go to the Resource Quota page to view it.

    rpc.infer_mode

    Set to the service type:

    • prefill

    • decode

    workspace_id

    The workspace ID. Go to the Workspace Details page to view it.

    cpu

    The resources used for the service. For example, to deploy QwQ-32B-Preview:

    • cpu: 16

    • memory (MB): 128000

    • gpu: 1

    For other models, adjust according to the memory requirements.

    gpu

    memory

    storage

    oss.path

    The OSS path where the model is located. For example, set to oss://pai-quickstart-cn-wulanchabu/modelscope/models/QwQ-32B/ for public models. You can also use other public models or your own fine-tuned models.

    oss.endpoint

    The OSS endpoint address.

    Here are the samples for the prefill and decode services. You need to update the relevant parameters as described in the table above.

    Prefill

    {
        "metadata": {
            "resource_burstable": false,
            "instance": 1,
            "rdma": 1,
            "rpc": {
                "infer_mode": "prefill"
            },
            "name": "qwq_32b_p",
            "group": "qwq_32b_pd",
            "quota_id": "quota1s47n5z****",
            "quota_type": "Lingjun",
            "workspace_id": "3**",
            "cpu": 16,
            "memory": 128000,
            "gpu": 1
        },
        "containers": [
            {
                "image": "eas-registry-vpc.cn-wulanchabu.cr.aliyuncs.com/pai-eas/blade-llm:0.10.0rc12",
                "port": 8001,
                "env": [
                    {
                        "name": "ENABLE_MESSAGE_BUS",
                        "value": "on"
                    }
                ],
                "script": "gpu_count=$(nvidia-smi --query-gpu=count --format=csv,noheader | wc -l); blade_llm_server --model /model_dir --host 0.0.0.0 --port 8001 -tp $gpu_count --metric_exporters logger eas --enable_disagg --disable_frontend_multiprocessing --disagg_pd.disagg_transfer_type rdma --disagg_pd.select_decode_max_batched 5 --disagg_pd.token_port 10030 --disagg_pd.inst_role prefill --naming_url eas:http://127.0.0.1:9900"
            }
        ],
        "storage": [
            {
                "mount_path": "/model_dir/",
                "properties": {
                    "resource_type": "model",
                    "resource_use": "base"
                },
                "oss": {
                    "path": "oss://pai-quickstart-cn-wulanchabu/modelscope/models/QwQ-32B/",
                    "endpoint": "oss-cn-wulanchabu-internal.aliyuncs.com"
                }
            }
        ],
        "options": {
            "priority": 9
        },
        "cloud": {
            "networking": {
                "vpc_id": "vpc-0jl65jioii2v72bh9****",
                "vswitch_id": "vsw-0jlh0drtahzsooq3q****",
                "security_group_id": "sg-0jlcs30nnyf50o2x****"
            }
        }
    }

    Decode

    {
        "metadata": {
            "resource_burstable": false,
            "instance": 1,
            "rdma": 1,
            "rpc": {
                "infer_mode": "decode"
            },
            "name": "qwq_32b_d",
            "group": "qwq_32b_pd",
            "quota_id": "quota1s47n5z****",
            "quota_type": "Lingjun",
            "workspace_id": "3***",
            "cpu": 16,
            "memory": 128000,
            "gpu": 1
        },
        "containers": [
            {
                "image": "eas-registry-vpc.cn-wulanchabu.cr.aliyuncs.com/pai-eas/blade-llm:0.10.0rc12",
                "port": 8001,
                "env": [
                    {
                        "name": "ENABLE_MESSAGE_BUS",
                        "value": "on"
                    }
                ],
                "script": "gpu_count=$(nvidia-smi --query-gpu=count --format=csv,noheader | wc -l); blade_llm_server --model /model_dir --host 0.0.0.0 --port 8001 -tp $gpu_count --metric_exporters logger eas --enable_disagg --disable_frontend_multiprocessing --disagg_pd.disagg_transfer_type rdma --disagg_pd.select_decode_max_batched 5 --disagg_pd.token_port 10030 --disagg_pd.inst_role decode --naming_url eas:http://127.0.0.1:9900"
            }
        ],
        "storage": [
            {
                "mount_path": "/model_dir/",
                "properties": {
                    "resource_type": "model",
                    "resource_use": "base"
                },
                "oss": {
                    "path": "oss://pai-quickstart-cn-wulanchabu/modelscope/models/QwQ-32B/",
                    "endpoint": "oss-cn-wulanchabu-internal.aliyuncs.com"
                }
            }
        ],
        "options": {
            "priority": 9
        },
        "cloud": {
            "networking": {
                "vpc_id": "vpc-0jl65jioii2v72bh9****",
                "vswitch_id": "vsw-0jlh0drtahzsooq3q****",
                "security_group_id": "sg-0jlcs30nnyf50o2x****"
            }
        }
    }

Model gallery one-click deployment

Deploy the prefill and decode services separately in Model Gallery. Take QwQ-32B as an example:image

Parameter

Description

Deployment Method

Select Bladellm Accelerated Deployment, and choose according to the service type:

  • PD-Disaggregation-Prefill

  • PD-Disaggregation-Decode

Basic Information

Service Name

Enter a name for the service. For example:

  • Prefill: qwq_32b_p

  • Decode: qwq_32b_d

Group

Select New Group or Join. Make sure that the prefill service and decode service are in the same group. Example of group name: qwq_32b_pd.

Resource Deployment

Resource Type

Select Resource Quota.

Resource Quota

You need to purchased Lingjun resources and created a resource quota in advance. If no quota is available, click Associate Resource Quota to associate the created quota with the workspace.

Deployment Resources

Set the resources required for the service. For example, to deploy QwQ-32N:

  • vCPUs: 16

  • Memory (GB): 125

  • GPUs: 1

For other models, adjust according to the memory requirements.

VPC

VPC

Select the virtual private cloud (VPC), vSwitch, and security group that align with the Lingjun resources. To view these, go to the Resource Quota page, click the desired quota name, and find the details in the Network Information section.

vSwitch

Security Group Name

Send requests

To send requests, you just need to call the prefill service. On the Elastic Algorithm Service (EAS) page, click Invocation Method of the prefill service in the Service Type column. View the endpoint and token of the prefill service.

Note

You can choose to use the public endpoint or the VPC endpoint. If using the VPC endpoint, make sure that your client is within the same VPC as the service.

image

  • Sample request:

    curl -v <service_url>/v1/chat/completions \
      -H "Authorization: <token>" \
      -H "Content-Type: application/json" \
      -d '{
        "messages": [
          {
            "role": "system",
            "content": "You are a helpful assistant."
          },
          {
            "role": "user",
            "content": "Hello!"
          }
        ],
        "max_tokens": 10,
        "stream": true
      }'

    Where:

    • <service_url>: Replace with the endpoint. Example: http://**********.cn-wulanchabu.pai-eas.aliyuncs.com/api/predict/qwq_32b_pd.qwq_32b_p.

    • <token>: Replace with the token.

  • Sample response:

    data: {"id":"2d14****-697b-43f3-9d62-1bb4dde6****","choices":[{"finish_reason":"","index":0,"logprobs":null,"delta":{"role":"assistant","content":"Hello"}}],"object":"chat.completion.chunk","usage":{"prompt_tokens":21,"completion_tokens":1,"total_tokens":22},"error_info":null}
    
    data: {"id":"2d14****-697b-43f3-9d62-1bb4dde6****","choices":[{"finish_reason":"","index":0,"logprobs":null,"delta":{"content":" there"}}],"object":"chat.completion.chunk","usage":{"prompt_tokens":21,"completion_tokens":2,"total_tokens":23},"error_info":null}
    
    data: {"id":"2d14****-697b-43f3-9d62-1bb4dde6****","choices":[{"finish_reason":"","index":0,"logprobs":null,"delta":{"content":"!"}}],"object":"chat.completion.chunk","usage":{"prompt_tokens":21,"completion_tokens":3,"total_tokens":24},"error_info":null}
    
    data: {"id":"2d14****-697b-43f3-9d62-1bb4dde6****","choices":[{"finish_reason":"","index":0,"logprobs":null,"delta":{"content":" How"}}],"object":"chat.completion.chunk","usage":{"prompt_tokens":21,"completion_tokens":4,"total_tokens":25},"error_info":null}
    
    data: {"id":"2d14****-697b-43f3-9d62-1bb4dde6****","choices":[{"finish_reason":"","index":0,"logprobs":null,"delta":{"content":" can"}}],"object":"chat.completion.chunk","usage":{"prompt_tokens":21,"completion_tokens":5,"total_tokens":26},"error_info":null}
    
    data: {"id":"2d14****-697b-43f3-9d62-1bb4dde6****","choices":[{"finish_reason":"","index":0,"logprobs":null,"delta":{"content":" I"}}],"object":"chat.completion.chunk","usage":{"prompt_tokens":21,"completion_tokens":6,"total_tokens":27},"error_info":null}
    
    data: {"id":"2d14****-697b-43f3-9d62-1bb4dde6****","choices":[{"finish_reason":"","index":0,"logprobs":null,"delta":{"content":" assist"}}],"object":"chat.completion.chunk","usage":{"prompt_tokens":21,"completion_tokens":7,"total_tokens":28},"error_info":null}
    
    data: {"id":"2d14****-697b-43f3-9d62-1bb4dde6****","choices":[{"finish_reason":"","index":0,"logprobs":null,"delta":{"content":" you"}}],"object":"chat.completion.chunk","usage":{"prompt_tokens":21,"completion_tokens":8,"total_tokens":29},"error_info":null}
    
    data: {"id":"2d14****-697b-43f3-9d62-1bb4dde6****","choices":[{"finish_reason":"","index":0,"logprobs":null,"delta":{"content":" today"}}],"object":"chat.completion.chunk","usage":{"prompt_tokens":21,"completion_tokens":9,"total_tokens":30},"error_info":null}
    
    data: {"id":"2d14****-697b-43f3-9d62-1bb4dde6****","choices":[{"finish_reason":"","index":0,"logprobs":null,"delta":{"content":"?"}}],"object":"chat.completion.chunk","usage":{"prompt_tokens":21,"completion_tokens":10,"total_tokens":31},"error_info":null}
    
    data: {"id":"2d14****-697b-43f3-9d62-1bb4dde6****","choices":[{"finish_reason":"length","index":0,"logprobs":null,"delta":{"content":""}}],"object":"chat.completion.chunk","usage":{"prompt_tokens":21,"completion_tokens":11,"total_tokens":32},"error_info":null}
    
    data: [DONE]

Performance testing

Note

The following results are for reference only. Your actual results may vary

Taking the evaluation of Qwen2.5-72B-Instruct as an example. The model is deployed using the Lingjun multi-tenant GP7V instance type, with an average ratio of 1100:140 for input to output tokens. Here are the performance improvements:

Configuration

QPS improvement

TPS improvement

2 Prefill + 1 Decode

25.5%

7.1%

3 Prefill + 2 Decode

20.7%

14.3%

Based on different business attributes and TPS targets, the optimal ratio of Prefill to Decode:

TPS target

PD ratio

Example configuration

20

Prefill : Decode = 1: 2.3

10 Prefill and 23 Decode

15

Prefill : Decode = 1.4:1

14 Prefill and 10 Decode