All Products
Search
Document Center

Platform For AI:Improve inference efficiency with the LLM intelligent router

Last Updated:Dec 18, 2025

In large language model (LLM) applications, traditional load balancing strategies struggle to accurately gauge real-time backend load. This is because request and response lengths vary, the number of tokens generated during the Prompt and Generate phases is random, and GPU resource consumption is unpredictable. These factors lead to uneven instance loads, which affects system throughput and response efficiency. To address this issue, Elastic Algorithm Service (EAS) introduces the LLM intelligent router component. The router dynamically distributes requests based on LLM-specific metrics to balance the computing power and GPU memory of each inference instance. This improves cluster resource utilization and system stability.

How it works

Architecture overview

The LLM intelligent router consists of three core components that provide intelligent traffic distribution and management for the backend LLM inference instance cluster: LLM Gateway, LLM Scheduler, and LLM Agent.

  • LLM Gateway: Serves as the traffic entry point. It receives all user requests and forwards them to the designated backend inference instance based on decisions from the LLM Scheduler. The gateway supports HTTP (HTTP_SSE) and WebSocket protocols and can cache requests when backend inference instances are under a high load.

  • LLM Scheduler: The brain of the intelligent router. It executes the scheduling algorithm by gathering real-time metrics from all LLM Agents. It then calculates the optimal target instance for each incoming request based on a predefined scheduling policy, such as prefix caching.

  • LLM Agent: Deployed as a sidecar container alongside each inference instance. The agent collects performance metrics from the inference engine, maintains a heartbeat with the LLM Scheduler, and reports the health status and load data of the instance.

image

Implementation flow

The LLM intelligent router is a special type of EAS service that must be deployed in the same service group as the inference service to function correctly. After you deploy the LLM intelligent router and the inference service, the router intelligently schedules requests to the backend inference service. The process is as follows:

  1. Instance registration: After the inference service starts, the LLM Agent waits for the inference engine to be ready. It then registers the instance with the LLM Scheduler and periodically reports its health status and performance metrics.

  2. Traffic ingress: User requests first arrive at the LLM Gateway. The gateway supports HTTP (SSE) and WebSocket protocols.

  3. Scheduling request: The LLM Gateway sends a scheduling request to the LLM Scheduler.

  4. Intelligent scheduling: The LLM Scheduler selects the optimal backend instance based on its scheduling policy and real-time metrics from each LLM Agent.

  5. Request forwarding: The LLM Scheduler returns its decision to the LLM Gateway. The LLM Gateway then forwards the original user request to the target instance for inference.

  6. Request buffering: If all backend instances are under a high load, the LLM Gateway temporarily queues new requests. These requests wait for the LLM Scheduler to find a suitable time to forward them, which prevents request failures.

Failover mechanism

The system is designed with multiple layers of fault tolerance to ensure service stability:

  • LLM Gateway: As a stateless traffic ingress layer, the `LLM Gateway` should be deployed with at least two instances. If one instance fails, traffic is automatically routed to other healthy instances. This ensures service continuity and high availability (HA).

  • LLM Scheduler: As the request scheduling component, the `LLM Scheduler` runs as a single instance to achieve global scheduling. If the LLM Scheduler fails, the LLM Gateway automatically enters a degraded mode after a heartbeat failure. It then reverts to a polling policy to forward requests directly to backend instances. This ensures service availability but sacrifices scheduling performance. After the LLM Scheduler recovers, the LLM Gateway automatically switches back to smart routing mode.

  • Inference instance or LLM Agent: If an inference instance or its associated LLM Agent fails, the heartbeat between the LLM Agent and the LLM Scheduler is interrupted. The LLM Scheduler immediately removes the failed instance from the list of available services and stops assigning new traffic to it. After the instance recovers and resumes sending heartbeats, it is automatically added back to the service list.

Support for multiple inference engines

Different LLM inference engines expose different metrics through their /metrics interfaces. The `LLM Agent` collects these metrics, formats them into a unified structure, and reports them. This design allows the `LLM Scheduler` to focus on scheduling algorithms based on unified metrics without needing to know the implementation details of specific inference engines. The currently supported LLM inference engines and their corresponding collected metrics are as follows:

LLM inference engine

Metric

Note

Blade_LLM

decode_batch_size_mean

The number of running requests.

wait_queue_size_mean

The number of waiting requests in the queue.

block_usage_gpu_mean

The GPU KV Cache usage.

tps_total

The total number of tokens processed per second.

tps_out

The number of generated tokens per second.

vLLM

vllm:num_requests_running

The number of running requests.

vllm:num_requests_waiting

The number of waiting requests in the queue.

vllm:gpu_cache_usage_perc

The GPU KV Cache usage.

vllm:prompt_tokens_total

The total number of prompt tokens.

vllm:generation_tokens_total

The total number of generated tokens.

SGLang

sglang:num_running_reqs

The number of running requests.

sglang:num_queue_reqs

The number of waiting requests in the queue.

sglang:token_usage

The KV Cache usage.

sglang:prompt_tokens_total

The total number of prompt tokens.

sglang:gen_throughput

The number of generated tokens per second.

Limits

  • Not supported for service updates: You can configure the LLM intelligent router only when you create a new service. You cannot add an intelligent router to an existing inference service using the Update Service operation.

  • Inference engine restrictions: Currently, only PAI-BladeLLM, vLLM, or SGLang are supported.

  • Multiple inference instances recommended: The LLM intelligent router is most effective when you deploy multiple inference instances.

Deploy services

Step 1: Deploy an LLM intelligent router service

The following two deployment methods are supported:

Method 1: Deploy a service in the console

  1. Log on to the PAI console. Select a region on the top of the page. Then, select the desired workspace and click Elastic Algorithm Service (EAS).

  2. On the Elastic Algorithm Service (EAS) page, click Deploy Service, and then select Scenario-based Model Deployment > Deploy LLM gateway.

  3. On the Deploy LLM gateway page, configure the following key parameters and click Deploy.

    Parameter

    Description

    Basic Information

    Service Name

    Enter a custom service name, such as llm_gateway.

    Deployment Resources

    Resource configuration for the LLM Gateway. To ensure high availability, the Minimum Number of Instances defaults to 2. We recommend that you keep this setting. The default CPU is 4 cores, and the default Memory is 8 GB.

    Scheduling Configuration

    Resource configuration for the LLM Scheduler. The default CPU is 2 cores, and the default Memory is 4 GB.

    Scheduling Policy

    The system selects the optimal backend inference instance based on the scheduling policy. The following policies are supported:

    • Prefix Cache: A comprehensive KV Cache affinity scheduling policy that makes decisions based on multiple metrics. It forwards requests to instances that have cached the corresponding KV Cache for processing to maximize request processing efficiency. When you use this policy, you must enable the prefix caching feature of the engine.

    • LLM Metrics: Intelligently allocates service traffic based on various monitoring metrics of the LLM service to maximize resource utilization.

    • Minimum Requests: Preferentially assigns new requests to the instance with the fewest current requests.

    • Minimum Tokens: Preferentially assigns new requests to the instance that is currently processing the fewest tokens.

    • Static PD Disaggregation: In an LLM deployment that separates Prefill and Decode stages, select this policy to maximize scheduling efficiency. After you select this policy, you must set separate scheduling policies for Prefill and Decode.

    • Dynamic PD Disaggregation: Intelligently and dynamically switches between Prefill and Decode instances based on key metrics such as actual service load, KV Cache, and GPU utilization. In an LLM deployment that uses dynamic PD disaggregation, select this policy to maximize scheduling efficiency.

After a successful deployment, the system automatically creates a group service whose name uses the format group_LLM intelligent router service name. You can view this service on the Phased Release tab of the Elastic Algorithm Service (EAS) page.image

Note

The intelligent router and the service queue conflict with each other. They cannot coexist in the same service group.

Method 2: Deploy a service using JSON

  1. Log on to the PAI console. Select a region on the top of the page. Then, select the desired workspace and click Elastic Algorithm Service (EAS).

  2. On the Elastic Algorithm Service (EAS) page, click Deploy Service. In the Custom Model Deployment section, click On-Premises Deployment.

  3. The following sections provide a JSON configuration example and parameter descriptions. After you complete the configuration, click Deploy.

    • Configuration example:

      Basic configuration

      You can use the basic configuration to quickly deploy an LLM intelligent router service. Parameters that are not configured use their default values.

      LLM Scheduler is configured with 4 CPU cores and 4 GB of memory by default. The default scheduling policy is Prefix Cache.
      {
          "metadata": {
              "type": "LLMGatewayService",
              "cpu": 4,
              "group": "group_llm_gateway",
              "instance": 2,
              "memory": 8000,
              "name": "llm_gateway"
          }
      }

      Advanced configuration

      {
          "cloud": {
              "computing": {
                  "instance_type": "ecs.c7.large"
              }
          },
          "llm_gateway": {
              "max_queue_size": 128,
              "retry_count": 2,
              "wait_schedule_timeout": 5000,
              "wait_schedule_try_period": 500
          },
          "llm_scheduler": {
              "cpu": 2,
              "memory": 4000,
              "policy": "prefix-cache"
          },
          "metadata": {
              "group": "group_llm_gateway",
              "instance": 2,
              "name": "llm_gateway",
              "type": "LLMGatewayService"
          }
      }
    • Parameter descriptions:

      Parameter

      Description

      metadata

      type

      Set this parameter to LLMGatewayService to deploy an LLM intelligent router service. After the service is deployed, EAS automatically creates a composite service that contains an LLM Gateway and an LLM Scheduler.

      instance

      The number of LLM Gateway instances. We recommend that you set this parameter to 2 or more to prevent a single point of failure.

      cpu

      The CPU of the LLM Gateway.

      memory

      The memory of the LLM Gateway.

      group

      The service group to which the LLM intelligent router service belongs.

      cloud.computing.instance_type

      Specifies the resource specification used by the LLM Gateway. In this case, you do not need to configure metadata.cpu and metadata.memory.

      llm_gateway

      max_queue_size

      The maximum length of the LLM Gateway cache queue. Default value: 512.

      If the number of requests exceeds the processing capacity of the backend inference framework, the excess requests are cached in this queue to wait for scheduling.

      retry_count

      The number of retries. Default value: 2. If a backend inference instance is abnormal, the request is retried and forwarded to a new instance.

      wait_schedule_timeout

      The timeout period for scheduling a request. Default value: 10 seconds. If the backend engine is fully loaded, the request is periodically retried for scheduling.

      wait_schedule_try_period

      The interval between scheduling retries. Default value: 1 second.

      llm_scheduler

      cpu

      The CPU of the LLM Scheduler. Default value: 4 cores.

      memory

      The memory of the LLM Scheduler. Default value: 4 GB.

      policy

      The scheduling policy. For more information, see the description of console-based deployment. Default value: prefix-cache. Valid values:

      • prefix-cache: Prefix Cache.

      • llm-metric-based: LLM Metrics.

      • least-request: Minimum Requests.

      • least-token: Minimum Tokens.

      • pd-split: Static PD Disaggregation.

      • dynamic-pd-split: Dynamic PD Disaggregation.

      prefill_policy

      If you set policy to pd-split, you must set the scheduling policies for Prefill and Decode. Valid values: prefix-cache, llm-metric-based, least-request, and least-token.

      decode_policy

Step 2: Deploy a large language model (LLM) service

When you deploy an LLM service, you must associate it with the intelligent router service that you created. This topic uses the EAS scenario-based deployment of an LLM as an example. The procedure is as follows:

  1. Log on to the PAI console. Select a region on the top of the page. Then, select the desired workspace and click Elastic Algorithm Service (EAS).

  2. On the Inference Service page, click Deploy Service, and then select LLM Deployment. In the Features section, find and turn on the LLM Intelligent Router switch. From the drop-down list, select the LLM intelligent router service that you deployed in Step 1. For more information about the other parameters, see LLM Deployment.

    Note

    When you use vLLM for accelerated deployment and the Scheduling Policy of the LLM intelligent router service is set to Prefix Cache, you must enable the prefix caching feature of the engine.

  3. After you configure the parameters, click Deploy.

Invoke and test the service

Send all requests to the endpoint of the LLM intelligent router service, not to the specific backend inference service.

Obtain access credentials

  1. On the Elastic Algorithm Service (EAS) page, find the deployed LLM intelligent router service.

  2. Click the service name to go to the Overview page. In the Basic Information section, click View Endpoint Information.

  3. On the Invocation Method page, copy the Internet Endpoint and Token from the Service-specific Traffic Entry section.image

Construct and send a request

The final request URL is a combination of the endpoint of the LLM intelligent router and the API path of the model service.

  • URL structure<LLM intelligent router endpoint>/<LLM service API path>

  • Examplehttp://********.pai-eas.aliyuncs.com/api/predict/group_llm_gateway.llm_gateway/v1/chat/completions

Request example:

# Replace <YOUR_GATEWAY_URL> and <YOUR_TOKEN> with your actual information.
# Replace <model_name> with the actual model name.
curl -X POST "<YOUR_GATEWAY_URL>/v1/chat/completions" \
     -H "Authorization: Bearer <YOUR_TOKEN>" \
     -H "Content-Type: application/json" \
     -N \
     -d '{
           "model": "<model_name>",
           "messages": [{"role": "user", "content": "Hello"}],
           "stream": true
         }'

The following is an example of the returned result:

data: {"id":"chatcmpl-9a9f8299*****","object":"chat.completion.chunk","created":1762245102,"model":"Qwen3-8B","choices":[{"index":0,"delta":{"role":"assistant","content":""},"logprobs":null,"finish_reason":null}]}
data: {"id":"chatcmpl-9a9f8299*****","object":"chat.completion.chunk","created":1762245102,"model":"Qwen3-8B","choices":[{"index":0,"delta":{"content":"<think>","tool_calls":[]}}]}

...
data: [DONE]

View service monitoring metrics

After you deploy the service, you can view its core performance metrics in the EAS console to evaluate the effectiveness of the intelligent router.

On the Elastic Algorithm Service (EAS) page, click the deployed LLM intelligent router service to go to the service details page. On the Monitoring tab, you can view the following core metrics:

Token Throughput

The throughput of LLM input and output tokens.image

  • IN: The throughput of LLM input tokens.

  • OUT: The throughput of LLM output tokens.

GPU Cache Usage

The GPU KV Cache usage of the LLM Engine.

image

Engine Current Requests

The number of concurrent real-time requests for the LLM Engine.

image

  • Running: The number of requests that the LLM Engine is currently processing.

  • Waiting: The number of requests in the waiting queue of the LLM Engine.

Gateway Current Requests

The number of real-time requests for the LLM intelligent router.

image

  • Total: The total number of requests that are received by the LLM intelligent router. This value indicates the total number of real-time concurrent requests.

  • Pending: The number of requests that are cached in the LLM intelligent router and have not been processed by the LLM Engine.

Time To First Token

The latency of the first token of a request.

image

  • Max: The maximum latency of the first token of a request.

  • Avg: The average latency of the first token of a request.

  • Min: The minimum latency of the first token of a request.

  • TPxx: The percentile values for the latency of the first token of a request.

Time Per Output Token

The latency of each token of a request.

image

  • Max: The maximum latency of each token of a request.

  • Avg: The average latency of each token of a request.

  • Min: The minimum latency of each token of a request.

  • TPxx: The percentile values for the latency of each token of a request.

Appendix: Performance test comparison

Tests on the Distill-Qwen-7B, QwQ-32B, and Qwen2.5-72B models show that the LLM intelligent router provides significant performance improvements in the speed and throughput of inference services. The test environment and results are as follows:

Important

The following test results are for reference only. Your actual performance may vary.

Test environment

Configuration item

Distill-Qwen-7B

QwQ-32B

Qwen2.5-72B

Scheduling policy

prefix-cache

prefix-cache

prefix-cache

Test data (multi-turn conversation)

ShareGPT_V3_unfiltered_cleaned_split.json

ShareGPT_V3_unfiltered_cleaned_split.json

ShareGPT_V3_unfiltered_cleaned_split.json

Inference engine for testing

vLLM (0.7.3)

vLLM (0.7.3)

vLLM (0.7.3)

Number of backend instances

5

5

5

Card type

ml.gu8tf.8.40xlarge

ml.gu8tf.8.40xlarge

ml.gu7xf.8xlarge-gu108

Number of concurrent requests

500

100

100

Test results

Metric

Distill-Qwen-7B

QwQ-32B

Qwen2.5-72B

Without LLM intelligent router

With LLM intelligent router

Improvement

Without LLM intelligent router

With LLM intelligent router

Improvement

Without LLM intelligent router

With LLM intelligent router

Improvement

Successful requests

3698

3612

-

1194

1194

-

1194

1194

-

Benchmark duration

460.79s

435.70s

-

1418.54s

1339.04s

-

479.53s

456.69s

-

Total input tokens

6605953

6426637

-

2646701

2645010

-

1336301

1337015

-

Total generated tokens

4898730

4750113

-

1908956

1902894

-

924856

925208

-

Request throughput

8.03 req/s

8.29 req/s

+3.2%

0.84 req/s

0.89 req/s

+5.95%

2.49 req/s

2.61 req/s

+4.8%

Output token throughput

10631.17 tok/s

10902.30 tok/s

+2.5%

1345.72 tok/s

1421.08 tok/s

+5.6%

1928.66 tok/s

2025.92 tok/s

+5.0%

Total Token throughput

24967.33 tok/s

25652.51 tok/s

+2.7%

3211.52 tok/s

3396.38 tok/s

+5.8%

4715.34 tok/s

4953.56 tok/s

+5.0%

Mean TTFT

532.79 ms

508.90 ms

+4.5%

1144.62 ms

859.42 ms

+25.0%

508.55 ms

389.66 ms

+23.4%

Median TTFT

274.23 ms

246.30 ms

-

749.39 ms

565.61 ms

-

325.33 ms

190.04 ms

-

P99 TTFT

3841.49 ms

3526.62 ms

-

5339.61 ms

5027.39 ms

-

2802.26 ms

2678.70 ms

-

Mean TPOT

40.65 ms

39.20 ms

+3.5%

68.78 ms

65.73 ms

+4.4%

46.83 ms

43.97 ms

+4.4%

Median TPOT

41.14 ms

39.61 ms

-

69.19 ms

66.33 ms

-

45.37 ms

43.30 ms

-

P99 TPOT

62.57 ms

58.71 ms

-

100.35 ms

95.55 ms

-

62.29 ms

54.79 ms

-