In large language model (LLM) applications, traditional load balancing strategies struggle to accurately gauge real-time backend load. This is because request and response lengths vary, the number of tokens generated during the Prompt and Generate phases is random, and GPU resource consumption is unpredictable. These factors lead to uneven instance loads, which affects system throughput and response efficiency. To address this issue, Elastic Algorithm Service (EAS) introduces the LLM intelligent router component. The router dynamically distributes requests based on LLM-specific metrics to balance the computing power and GPU memory of each inference instance. This improves cluster resource utilization and system stability.
How it works
Architecture overview
The LLM intelligent router consists of three core components that provide intelligent traffic distribution and management for the backend LLM inference instance cluster: LLM Gateway, LLM Scheduler, and LLM Agent.
LLM Gateway: Serves as the traffic entry point. It receives all user requests and forwards them to the designated backend inference instance based on decisions from the
LLM Scheduler. The gateway supports HTTP (HTTP_SSE) and WebSocket protocols and can cache requests when backend inference instances are under a high load.LLM Scheduler: The brain of the intelligent router. It executes the scheduling algorithm by gathering real-time metrics from all
LLM Agents. It then calculates the optimal target instance for each incoming request based on a predefined scheduling policy, such as prefix caching.LLM Agent: Deployed as a sidecar container alongside each inference instance. The agent collects performance metrics from the inference engine, maintains a heartbeat with the
LLM Scheduler, and reports the health status and load data of the instance.
Implementation flow
The LLM intelligent router is a special type of EAS service that must be deployed in the same service group as the inference service to function correctly. After you deploy the LLM intelligent router and the inference service, the router intelligently schedules requests to the backend inference service. The process is as follows:
Instance registration: After the inference service starts, the
LLM Agentwaits for the inference engine to be ready. It then registers the instance with theLLM Schedulerand periodically reports its health status and performance metrics.Traffic ingress: User requests first arrive at the
LLM Gateway. The gateway supports HTTP (SSE) and WebSocket protocols.Scheduling request: The
LLM Gatewaysends a scheduling request to theLLM Scheduler.Intelligent scheduling: The
LLM Schedulerselects the optimal backend instance based on its scheduling policy and real-time metrics from eachLLM Agent.Request forwarding: The
LLM Schedulerreturns its decision to theLLM Gateway. TheLLM Gatewaythen forwards the original user request to the target instance for inference.Request buffering: If all backend instances are under a high load, the
LLM Gatewaytemporarily queues new requests. These requests wait for theLLM Schedulerto find a suitable time to forward them, which prevents request failures.
Failover mechanism
The system is designed with multiple layers of fault tolerance to ensure service stability:
LLM Gateway: As a stateless traffic ingress layer, the `LLM Gateway` should be deployed with at least two instances. If one instance fails, traffic is automatically routed to other healthy instances. This ensures service continuity and high availability (HA).
LLM Scheduler: As the request scheduling component, the `LLM Scheduler` runs as a single instance to achieve global scheduling. If the
LLM Schedulerfails, theLLM Gatewayautomatically enters a degraded mode after a heartbeat failure. It then reverts to a polling policy to forward requests directly to backend instances. This ensures service availability but sacrifices scheduling performance. After theLLM Schedulerrecovers, theLLM Gatewayautomatically switches back to smart routing mode.Inference instance or LLM Agent: If an inference instance or its associated
LLM Agentfails, the heartbeat between theLLM Agentand theLLM Scheduleris interrupted. TheLLM Schedulerimmediately removes the failed instance from the list of available services and stops assigning new traffic to it. After the instance recovers and resumes sending heartbeats, it is automatically added back to the service list.
Support for multiple inference engines
Different LLM inference engines expose different metrics through their /metrics interfaces. The `LLM Agent` collects these metrics, formats them into a unified structure, and reports them. This design allows the `LLM Scheduler` to focus on scheduling algorithms based on unified metrics without needing to know the implementation details of specific inference engines. The currently supported LLM inference engines and their corresponding collected metrics are as follows:
LLM inference engine | Metric | Note |
Blade_LLM | decode_batch_size_mean | The number of running requests. |
wait_queue_size_mean | The number of waiting requests in the queue. | |
block_usage_gpu_mean | The GPU KV Cache usage. | |
tps_total | The total number of tokens processed per second. | |
tps_out | The number of generated tokens per second. | |
vLLM | vllm:num_requests_running | The number of running requests. |
vllm:num_requests_waiting | The number of waiting requests in the queue. | |
vllm:gpu_cache_usage_perc | The GPU KV Cache usage. | |
vllm:prompt_tokens_total | The total number of prompt tokens. | |
vllm:generation_tokens_total | The total number of generated tokens. | |
SGLang | sglang:num_running_reqs | The number of running requests. |
sglang:num_queue_reqs | The number of waiting requests in the queue. | |
sglang:token_usage | The KV Cache usage. | |
sglang:prompt_tokens_total | The total number of prompt tokens. | |
sglang:gen_throughput | The number of generated tokens per second. |
Limits
Not supported for service updates: You can configure the LLM intelligent router only when you create a new service. You cannot add an intelligent router to an existing inference service using the Update Service operation.
Inference engine restrictions: Currently, only PAI-BladeLLM, vLLM, or SGLang are supported.
Multiple inference instances recommended: The LLM intelligent router is most effective when you deploy multiple inference instances.
Deploy services
Step 1: Deploy an LLM intelligent router service
The following two deployment methods are supported:
Method 1: Deploy a service in the console
Log on to the PAI console. Select a region on the top of the page. Then, select the desired workspace and click Elastic Algorithm Service (EAS).
On the Elastic Algorithm Service (EAS) page, click Deploy Service, and then select Scenario-based Model Deployment > Deploy LLM gateway.
On the Deploy LLM gateway page, configure the following key parameters and click Deploy.
Parameter
Description
Basic Information
Service Name
Enter a custom service name, such as llm_gateway.
Deployment Resources
Resource configuration for the
LLM Gateway. To ensure high availability, the Minimum Number of Instances defaults to 2. We recommend that you keep this setting. The default CPU is 4 cores, and the default Memory is 8 GB.Scheduling Configuration
Resource configuration for the
LLM Scheduler. The default CPU is 2 cores, and the default Memory is 4 GB.Scheduling Policy
The system selects the optimal backend inference instance based on the scheduling policy. The following policies are supported:
Prefix Cache: A comprehensive KV Cache affinity scheduling policy that makes decisions based on multiple metrics. It forwards requests to instances that have cached the corresponding KV Cache for processing to maximize request processing efficiency. When you use this policy, you must enable the prefix caching feature of the engine.
LLM Metrics: Intelligently allocates service traffic based on various monitoring metrics of the LLM service to maximize resource utilization.
Minimum Requests: Preferentially assigns new requests to the instance with the fewest current requests.
Minimum Tokens: Preferentially assigns new requests to the instance that is currently processing the fewest tokens.
Static PD Disaggregation: In an LLM deployment that separates Prefill and Decode stages, select this policy to maximize scheduling efficiency. After you select this policy, you must set separate scheduling policies for Prefill and Decode.
Dynamic PD Disaggregation: Intelligently and dynamically switches between Prefill and Decode instances based on key metrics such as actual service load, KV Cache, and GPU utilization. In an LLM deployment that uses dynamic PD disaggregation, select this policy to maximize scheduling efficiency.
After a successful deployment, the system automatically creates a group service whose name uses the format group_LLM intelligent router service name. You can view this service on the Phased Release tab of the Elastic Algorithm Service (EAS) page.
The intelligent router and the service queue conflict with each other. They cannot coexist in the same service group.
Method 2: Deploy a service using JSON
Log on to the PAI console. Select a region on the top of the page. Then, select the desired workspace and click Elastic Algorithm Service (EAS).
On the Elastic Algorithm Service (EAS) page, click Deploy Service. In the Custom Model Deployment section, click On-Premises Deployment.
The following sections provide a JSON configuration example and parameter descriptions. After you complete the configuration, click Deploy.
Configuration example:
Basic configuration
You can use the basic configuration to quickly deploy an LLM intelligent router service. Parameters that are not configured use their default values.
LLM Scheduler is configured with 4 CPU cores and 4 GB of memory by default. The default scheduling policy is Prefix Cache.
{ "metadata": { "type": "LLMGatewayService", "cpu": 4, "group": "group_llm_gateway", "instance": 2, "memory": 8000, "name": "llm_gateway" } }Advanced configuration
{ "cloud": { "computing": { "instance_type": "ecs.c7.large" } }, "llm_gateway": { "max_queue_size": 128, "retry_count": 2, "wait_schedule_timeout": 5000, "wait_schedule_try_period": 500 }, "llm_scheduler": { "cpu": 2, "memory": 4000, "policy": "prefix-cache" }, "metadata": { "group": "group_llm_gateway", "instance": 2, "name": "llm_gateway", "type": "LLMGatewayService" } }Parameter descriptions:
Parameter
Description
metadata
type
Set this parameter to LLMGatewayService to deploy an LLM intelligent router service. After the service is deployed, EAS automatically creates a composite service that contains an
LLM Gatewayand anLLM Scheduler.instance
The number of
LLM Gatewayinstances. We recommend that you set this parameter to 2 or more to prevent a single point of failure.cpu
The CPU of the
LLM Gateway.memory
The memory of the
LLM Gateway.group
The service group to which the LLM intelligent router service belongs.
cloud.computing.instance_type
Specifies the resource specification used by the
LLM Gateway. In this case, you do not need to configure metadata.cpu and metadata.memory.llm_gateway
max_queue_size
The maximum length of the
LLM Gatewaycache queue. Default value: 512.If the number of requests exceeds the processing capacity of the backend inference framework, the excess requests are cached in this queue to wait for scheduling.
retry_count
The number of retries. Default value: 2. If a backend inference instance is abnormal, the request is retried and forwarded to a new instance.
wait_schedule_timeout
The timeout period for scheduling a request. Default value: 10 seconds. If the backend engine is fully loaded, the request is periodically retried for scheduling.
wait_schedule_try_period
The interval between scheduling retries. Default value: 1 second.
llm_scheduler
cpu
The CPU of the
LLM Scheduler. Default value: 4 cores.memory
The memory of the
LLM Scheduler. Default value: 4 GB.policy
The scheduling policy. For more information, see the description of console-based deployment. Default value: prefix-cache. Valid values:
prefix-cache: Prefix Cache.
llm-metric-based: LLM Metrics.
least-request: Minimum Requests.
least-token: Minimum Tokens.
pd-split: Static PD Disaggregation.
dynamic-pd-split: Dynamic PD Disaggregation.
prefill_policy
If you set policy to pd-split, you must set the scheduling policies for Prefill and Decode. Valid values: prefix-cache, llm-metric-based, least-request, and least-token.
decode_policy
Step 2: Deploy a large language model (LLM) service
When you deploy an LLM service, you must associate it with the intelligent router service that you created. This topic uses the EAS scenario-based deployment of an LLM as an example. The procedure is as follows:
Log on to the PAI console. Select a region on the top of the page. Then, select the desired workspace and click Elastic Algorithm Service (EAS).
On the Inference Service page, click Deploy Service, and then select LLM Deployment. In the Features section, find and turn on the LLM Intelligent Router switch. From the drop-down list, select the LLM intelligent router service that you deployed in Step 1. For more information about the other parameters, see LLM Deployment.
NoteWhen you use vLLM for accelerated deployment and the Scheduling Policy of the LLM intelligent router service is set to Prefix Cache, you must enable the prefix caching feature of the engine.
After you configure the parameters, click Deploy.
Invoke and test the service
Send all requests to the endpoint of the LLM intelligent router service, not to the specific backend inference service.
Obtain access credentials
On the Elastic Algorithm Service (EAS) page, find the deployed LLM intelligent router service.
Click the service name to go to the Overview page. In the Basic Information section, click View Endpoint Information.
On the Invocation Method page, copy the Internet Endpoint and Token from the Service-specific Traffic Entry section.

Construct and send a request
The final request URL is a combination of the endpoint of the LLM intelligent router and the API path of the model service.
URL structure:
<LLM intelligent router endpoint>/<LLM service API path>Example:
http://********.pai-eas.aliyuncs.com/api/predict/group_llm_gateway.llm_gateway/v1/chat/completions
Request example:
# Replace <YOUR_GATEWAY_URL> and <YOUR_TOKEN> with your actual information.
# Replace <model_name> with the actual model name.
curl -X POST "<YOUR_GATEWAY_URL>/v1/chat/completions" \
-H "Authorization: Bearer <YOUR_TOKEN>" \
-H "Content-Type: application/json" \
-N \
-d '{
"model": "<model_name>",
"messages": [{"role": "user", "content": "Hello"}],
"stream": true
}'The following is an example of the returned result:
data: {"id":"chatcmpl-9a9f8299*****","object":"chat.completion.chunk","created":1762245102,"model":"Qwen3-8B","choices":[{"index":0,"delta":{"role":"assistant","content":""},"logprobs":null,"finish_reason":null}]}
data: {"id":"chatcmpl-9a9f8299*****","object":"chat.completion.chunk","created":1762245102,"model":"Qwen3-8B","choices":[{"index":0,"delta":{"content":"<think>","tool_calls":[]}}]}
...
data: [DONE]View service monitoring metrics
After you deploy the service, you can view its core performance metrics in the EAS console to evaluate the effectiveness of the intelligent router.
On the Elastic Algorithm Service (EAS) page, click the deployed LLM intelligent router service to go to the service details page. On the Monitoring tab, you can view the following core metrics:
Token Throughput The throughput of LLM input and output tokens.
| GPU Cache Usage The GPU KV Cache usage of the LLM Engine.
|
Engine Current Requests The number of concurrent real-time requests for the LLM Engine.
| Gateway Current Requests The number of real-time requests for the LLM intelligent router.
|
Time To First Token The latency of the first token of a request.
| Time Per Output Token The latency of each token of a request.
|





