This topic explains how to use Service Mesh (ASM) to enhance load balancing and traffic management for LLM inference services deployed in Kubernetes clusters. Due to the unique characteristics of LLM inference traffic and workloads, traditional load balancing methods may fall short. This topic walks you through the steps to define service pools and routing for vLLM inference services to improve performance and gain insight into inference traffic.
Reading tips
Before you begin, ensure you are familiar with:
To utilize GPU computing capabilities within an ACS cluster, refer to Specify GPU models and driver versions for ACS GPU-accelerated podslize GPU computing capabilities within an ACS cluster for detailed instructions.
To create and use GPU node pools in an ACK cluster or to utilize ACS computing power, refer to Create a GPU node pool or Use the computing power of ACS in ACK Pro clusters for detailed instructions.
By reading this topic, you will learn about:
The background of large language models and vLLM.
Challenges in managing LLM inference services in a cluster using conventional methods.
The concepts and practical steps for managing LLM inference services in a cluster using ASM.
Background
Large language model (LLM)
Large language models (LLMs) are neural network-based language models with billions of parameters, exemplified by GPT, Qwen, and Llama. These models are trained on diverse and extensive pre-training datasets, including web text, professional literature, and code, and are primarily used for text generation tasks such as completion and dialogue.
To leverage LLMs for building applications, you can:
Utilize external LLM API services from platforms like OpenAI, Alibaba Cloud Model Studio, or Moonshot.
Build your own LLM inference services using open-source or proprietary models and frameworks such as vLLM, and deploy them in a Kubernetes cluster. This approach is suitable for scenarios requiring control over the inference service or high customization of LLM inference capabilities.
vLLM
vLLM is a framework designed for efficient and user-friendly construction of LLM inference services. It supports various large language models, including Qwen, and optimizes LLM inference efficiency through techniques like PagedAttention, dynamic batch inference (Continuous Batching), and model quantization.
Load balancing and observability
ASM facilitates the management of LLM inference service traffic within a cluster. When deploying an LLM inference service, you can declare the workloads providing the service and the model nameS through InferencePool and InferenceModel Custom Resource Definitions (CRDs). ASM then provides load balancing, traffic routing, and observability for LLM inference services targeting LLM inference backends.
Currently, only LLM inference services based on vLLM are supported.
Traditional load balancingClassic load balancing algorithms distribute HTTP requests evenly across various workloads. However, with LLM inference services, the load each request imposes on the backend is unpredictable. The inference process consists of two phases: prefill and decode:
| LLM load balancingASM offers a load balancing algorithm tailored for LLM backends. It evaluates the inference server's internal state using multi-dimensional metrics and balances the workload across multiple servers. Key metrics include the following:
This approach outperforms traditional algorithms by ensuring consistent GPU load across inference services, reducing the response latency for the first token of LLM requests (ttft), and enhancing throughput. |
Traditional observabilityLLM inference services typically interact using the request API format of OpenAI. Most request metadata, like the model name and maximum token count, is in the request body. Traditional routing and observability capabilities, based on request headers and paths, do not parse the request body and thus cannot accommodate traffic allocation or observation based on request model names or token counts. | Inference traffic observabilityASM enhances LLM inference capabilities in access logs and monitoring metrics for requests.
|
Prerequisites
Ensure you have either created an ACK managed cluster with a GPU node pool or selected an ACS cluster in a recommended zone for GPU computing power. For more information, see Create an ACK managed cluster and Create an ACS cluster.
You can install the ACK Virtual Node component in your ACK managed cluster to utilize ACS GPU computing capabilities. For more information, see ACS GPU computing power in ACK.
A cluster is added to the ASM instance of v1.24 or later. For more information, see Add a cluster to an ASM instance.
An ingress gateway is created and enable the HTTP service on port 8080. For more information, see Create an ingress gateway.
(Optional) A Sidecar is injected into the default namespace. For more information, see Enable automatic sidecar proxy injection.
NoteYou may skip the Sidecar injection if you do not want to explore the practical operations of observability.
Best practices
The following example demonstrates how to manage LLM inference service traffic in a cluster using ASM by deploying a Llama2 large model based on vLLM.
Step 1: Deploy a sample inference service
Create a file named vllm-service.yaml with the content provided.
NoteThe image discussed in this topic requires a GPU with more than 16 GiB of video memory. The T4 card type, which has 16 GiB of video memory, does not provide sufficient resources to launch this application. It is recommended to use the A10 card type for ACK clusters and the 8th generation GPU B for ACS clusters. For detailed model information, please submit a ticket for further assistance.
Due to the large size of the LLM image, it is advisable to pre-store it in ACR and use the internal network address for pulling. Pulling directly from the public network may result in long wait times depending on the cluster EIP bandwidth configuration.
ACS cluster
apiVersion: v1 kind: Service metadata: name: vllm-llama2-7b-pool spec: selector: app: vllm-llama2-7b-pool ports: - protocol: TCP port: 8000 targetPort: 8000 type: ClusterIP --- apiVersion: v1 kind: ConfigMap metadata: name: chat-template data: llama-2-chat.jinja: | {% if messages[0]['role'] == 'system' %} {% set system_message = '<<SYS>>\n' + messages[0]['content'] | trim + '\n<</SYS>>\n\n' %} {% set messages = messages[1:] %} {% else %} {% set system_message = '' %} {% endif %} {% for message in messages %} {% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %} {{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }} {% endif %} {% if loop.index0 == 0 %} {% set content = system_message + message['content'] %} {% else %} {% set content = message['content'] %} {% endif %} {% if message['role'] == 'user' %} {{ bos_token + '[INST] ' + content | trim + ' [/INST]' }} {% elif message['role'] == 'assistant' %} {{ ' ' + content | trim + ' ' + eos_token }} {% endif %} {% endfor %} --- apiVersion: apps/v1 kind: Deployment metadata: name: vllm-llama2-7b-pool namespace: default spec: replicas: 3 selector: matchLabels: app: vllm-llama2-7b-pool template: metadata: annotations: prometheus.io/path: /metrics prometheus.io/port: '8000' prometheus.io/scrape: 'true' labels: app: vllm-llama2-7b-pool alibabacloud.com/compute-class: gpu # Specify a GPU computing power alibabacloud.com/compute-qos: default alibabacloud.com/gpu-model-series: "example-model" # Specify the GPU model as example-model. Fill in according to the actual situation. spec: containers: - name: lora image: "registry-cn-hangzhou.ack.aliyuncs.com/ack-demo/llama2-with-lora:v0.2" imagePullPolicy: IfNotPresent command: ["python3", "-m", "vllm.entrypoints.openai.api_server"] args: - "--model" - "/model/llama2" - "--tensor-parallel-size" - "1" - "--port" - "8000" - '--gpu_memory_utilization' - '0.8' - "--enable-lora" - "--max-loras" - "4" - "--max-cpu-loras" - "12" - "--lora-modules" - 'sql-lora=/adapters/yard1/llama-2-7b-sql-lora-test_0' - 'sql-lora-1=/adapters/yard1/llama-2-7b-sql-lora-test_1' - 'sql-lora-2=/adapters/yard1/llama-2-7b-sql-lora-test_2' - 'sql-lora-3=/adapters/yard1/llama-2-7b-sql-lora-test_3' - 'sql-lora-4=/adapters/yard1/llama-2-7b-sql-lora-test_4' - 'tweet-summary=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_0' - 'tweet-summary-1=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_1' - 'tweet-summary-2=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_2' - 'tweet-summary-3=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_3' - 'tweet-summary-4=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_4' - '--chat-template' - '/etc/vllm/llama-2-chat.jinja' env: - name: PORT value: "8000" ports: - containerPort: 8000 name: http protocol: TCP livenessProbe: failureThreshold: 2400 httpGet: path: /health port: http scheme: HTTP initialDelaySeconds: 5 periodSeconds: 5 successThreshold: 1 timeoutSeconds: 1 readinessProbe: failureThreshold: 6000 httpGet: path: /health port: http scheme: HTTP initialDelaySeconds: 5 periodSeconds: 5 successThreshold: 1 timeoutSeconds: 1 resources: limits: cpu: 16 memory: 64Gi nvidia.com/gpu: 1 requests: cpu: 8 memory: 30Gi nvidia.com/gpu: 1 volumeMounts: - mountPath: /data name: data - mountPath: /dev/shm name: shm - mountPath: /etc/vllm name: chat-template restartPolicy: Always schedulerName: default-scheduler terminationGracePeriodSeconds: 30 volumes: - name: data emptyDir: {} - name: shm emptyDir: medium: Memory - name: chat-template configMap: name: chat-templateACK cluster
apiVersion: v1 kind: Service metadata: name: vllm-llama2-7b-pool spec: selector: app: vllm-llama2-7b-pool ports: - protocol: TCP port: 8000 targetPort: 8000 type: ClusterIP --- apiVersion: v1 kind: ConfigMap metadata: name: chat-template data: llama-2-chat.jinja: | {% if messages[0]['role'] == 'system' %} {% set system_message = '<<SYS>>\n' + messages[0]['content'] | trim + '\n<</SYS>>\n\n' %} {% set messages = messages[1:] %} {% else %} {% set system_message = '' %} {% endif %} {% for message in messages %} {% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %} {{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }} {% endif %} {% if loop.index0 == 0 %} {% set content = system_message + message['content'] %} {% else %} {% set content = message['content'] %} {% endif %} {% if message['role'] == 'user' %} {{ bos_token + '[INST] ' + content | trim + ' [/INST]' }} {% elif message['role'] == 'assistant' %} {{ ' ' + content | trim + ' ' + eos_token }} {% endif %} {% endfor %} --- apiVersion: apps/v1 kind: Deployment metadata: name: vllm-llama2-7b-pool namespace: default spec: replicas: 3 selector: matchLabels: app: vllm-llama2-7b-pool template: metadata: annotations: prometheus.io/path: /metrics prometheus.io/port: '8000' prometheus.io/scrape: 'true' labels: app: vllm-llama2-7b-pool spec: containers: - name: lora image: "registry-cn-hangzhou.ack.aliyuncs.com/ack-demo/llama2-with-lora:v0.2" imagePullPolicy: IfNotPresent command: ["python3", "-m", "vllm.entrypoints.openai.api_server"] args: - "--model" - "/model/llama2" - "--tensor-parallel-size" - "1" - "--port" - "8000" - '--gpu_memory_utilization' - '0.8' - "--enable-lora" - "--max-loras" - "4" - "--max-cpu-loras" - "12" - "--lora-modules" - 'sql-lora=/adapters/yard1/llama-2-7b-sql-lora-test_0' - 'sql-lora-1=/adapters/yard1/llama-2-7b-sql-lora-test_1' - 'sql-lora-2=/adapters/yard1/llama-2-7b-sql-lora-test_2' - 'sql-lora-3=/adapters/yard1/llama-2-7b-sql-lora-test_3' - 'sql-lora-4=/adapters/yard1/llama-2-7b-sql-lora-test_4' - 'tweet-summary=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_0' - 'tweet-summary-1=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_1' - 'tweet-summary-2=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_2' - 'tweet-summary-3=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_3' - 'tweet-summary-4=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_4' - '--chat-template' - '/etc/vllm/llama-2-chat.jinja' env: - name: PORT value: "8000" ports: - containerPort: 8000 name: http protocol: TCP livenessProbe: failureThreshold: 2400 httpGet: path: /health port: http scheme: HTTP initialDelaySeconds: 5 periodSeconds: 5 successThreshold: 1 timeoutSeconds: 1 readinessProbe: failureThreshold: 6000 httpGet: path: /health port: http scheme: HTTP initialDelaySeconds: 5 periodSeconds: 5 successThreshold: 1 timeoutSeconds: 1 resources: limits: nvidia.com/gpu: 1 requests: nvidia.com/gpu: 1 volumeMounts: - mountPath: /data name: data - mountPath: /dev/shm name: shm - mountPath: /etc/vllm name: chat-template restartPolicy: Always schedulerName: default-scheduler terminationGracePeriodSeconds: 30 volumes: - name: data emptyDir: {} - name: shm emptyDir: medium: Memory - name: chat-template configMap: name: chat-templateDeploy the LLM inference service using the kubeconfig file of the cluster on the data plane.
kubectl apply -f vllm-service.yaml
Step 2: Configure ASM gateway rules
Deploy gateway rules to enable port 8080 listening on the ASM gateway.
Create a file named gateway.yaml with the following content.
apiVersion: networking.istio.io/v1 kind: Gateway metadata: name: llm-inference-gateway namespace: default spec: selector: istio: ingressgateway servers: - hosts: - '*' port: name: http-service number: 8080 protocol: HTTPCreate a gateway rule.
kubectl apply -f gateway.yaml
Step 3: Configure routing and load balancing for LLM inference service
To compare the performance of traditional load balancing with LLM load balancing, you should complete the steps in (Optional) Compare performance with traditional load balancing using an observability dashboard prior to proceeding with further operations.
Enable routing for LLM inference service using the kubeconfig file of ASM.
kubectl patch asmmeshconfig default --type=merge --patch='{"spec":{"gatewayAPIInferenceExtension":{"enabled":true}}}'Deploy the InferencePool resource.
The InferencePool resource defines workloads for a set of LLM inference service in the cluster using a label selector. ASM will apply vLLM load balancing based on the created InferencePoo.
Create a file named inferencepool.yaml with the following content.
apiVersion: inference.networking.x-k8s.io/v1alpha1 kind: InferencePool metadata: name: vllm-llama2-7b-pool spec: targetPortNumber: 8000 selector: app: vllm-llama2-7b-poolThe following table explains some configuration items:
Configuration item
Description
.spec.targetPortNumber
The port exposed by the Pod that provides inference services.
.spec.selector
The Pod label that provides inference services. The label key must be app, and the value must be the corresponding Service name.
Create an InferencePool resource using the kubeconfig file of the cluster on the data plane.
kubectl apply -f inferencepool.yaml
Deploy the InferenceModel resource.
The InferenceModel specifies the traffic distribution policy for specific models within the InferencePool resource.
Create a file named inferencemodel.yaml using the following content.
apiVersion: inference.networking.x-k8s.io/v1alpha1 kind: InferenceModel metadata: name: inferencemodel-sample spec: modelName: tweet-summary poolRef: group: inference.networking.x-k8s.io kind: InferencePool name: vllm-llama2-7b-pool targetModels: - name: tweet-summary weight: 100The following table explains some configuration items:
Configuration Item
Description
.spec.modelName
Used to match the model parameter in the request.
.spec.targetModels
Configure traffic routing rules. In the above example, traffic with the model: tweet-summary in the request header is 100% sent to the Pod running the tweet-summary model.
Create an InferenceModel resource.
kubectl apply -f inferencemodel.yaml
Create the LLMRoute resource.
Set up routing rules for the gateway to forward all requests received on port 8080 to the sample LLM inference service by referencing the InferencePool resource.
Create a file named llmroute.yaml with the following content.
apiVersion: istio.alibabacloud.com/v1 kind: LLMRoute metadata: name: test-llm-route spec: gateways: - llm-inference-gateway host: test.com rules: - backendRefs: - backendRef: group: inference.networking.x-k8s.io kind: InferencePool name: vllm-llama2-7b-poolDeploy the LLMRoute resource.
kubectl apply -f llmroute.yaml
Step 4: Verify
Run the following command multiple times to perform a test.
curl -H "host: test.com" ${ASM Gateway IP}:8080/v1/completions -H 'Content-Type: application/json' -d '{
"model": "tweet-summary",
"prompt": "Write as if you were a critic: San Francisco",
"max_tokens": 100,
"temperature": 0
}' -vExpected output:
{"id":"cmpl-2fc9a351-d866-422b-b561-874a30843a6b","object":"text_completion","created":1736933141,"model":"tweet-summary","choices":[{"index":0,"text":", I'm a newbie to this forum. Write a summary of the article.\nWrite a summary of the article.\nWrite a summary of the article. Write a summary of the article. Write a summary of the article. Write a summary of the article. Write a summary of the article. Write a summary of the article. Write a summary of the article. Write a summary of the article. Write a summary of the article. Write a summary of the article. Write a summary","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":2,"total_tokens":102,"completion_tokens":100,"prompt_tokens_details":null}}(Optional) Step 5: Configure observability metrics and dashboard for LLM services
After declaring LLM inference services in the cluster using InferencePool and InferenceMode resources and setting up routing policies, you can view the observability of the LLM inference service through logs and monitoring metrics.
Enable LLM traffic observability feature on the ASM console to collect monitoring metrics.
Improve the observability of LLM inference requests by incorporating additional log fields, metrics, and metric dimensions. For detailed configuration instructions, see Traffic observation: Efficiently manage LLM traffic using ASM.
Upon configuration completion, a
modeldimension will be incorporated into the ASM monitoring metrics. You can gather these metrics by either utilizing Prometheus within the observability monitoring framework or integrating a self-hosted Prometheus for service mesh monitoring.ASM introduces two new metrics:
asm_llm_proxy_prompt_tokens, representing the number of input tokens, andasm_llm_proxy_completion_tokens, representing the number of output tokens for all requests. You can incorporate these metrics by adding the following rules to Prometheus. For ins, see Other Prometheus service discovery configurations.scrape_configs: - job_name: asm-envoy-stats-llm scrape_interval: 30s scrape_timeout: 30s metrics_path: /stats/prometheus scheme: http kubernetes_sd_configs: - role: pod relabel_configs: - source_labels: - __meta_kubernetes_pod_container_port_name action: keep regex: .*-envoy-prom - source_labels: - __address__ - __meta_kubernetes_pod_annotation_prometheus_io_port action: replace regex: ([^:]+)(?::\d+)?;(\d+) replacement: $1:15090 target_label: __address__ - action: labelmap regex: __meta_kubernetes_pod_label_(.+) - source_labels: - __meta_kubernetes_namespace action: replace target_label: namespace - source_labels: - __meta_kubernetes_pod_name action: replace target_label: pod_name metric_relabel_configs: - action: keep source_labels: - __name__ regex: asm_llm_.*
Collect monitoring metrics for vLLM service.
The monitoring metrics for LLM inference requests mainly involve the throughput of external LLM inference requests. Add the annotations for Prometheus collector to the vLLM service pod to collect metrics exposed by the vLLM service and monitor its internal state.
... annotations: prometheus.io/path: /metrics # The HTTP path to which the metrics is exposed. prometheus.io/port: "8000" # The port to which the metrics are exposed, which is the listening port of the vLLM Server. prometheus.io/scrape: "true" # Whether to scrape the metrics of the current pod. ...Retrieve metrics related to the vLLM service using Prometheus's default service discovery mechanism. For detailed instructions, see Default service discovery.
Key metrics from the vLLM service provide insight into the internal state of the vLLM workload.
Metric name
Description
vllm:gpu_cache_usage_perc
The percentage of GPU cache usage of vLLM. When vLLM starts, it will pre-occupy as much GPU video memory as possible for KV cache. For the vLLM server, the lower the cache utilization, the more space the GPU has to allocate resources to new requests.
vllm:request_queue_time_seconds_sum
The time spent in the waiting state queue. After the LLM inference request arrives at the vLLM server, it may not be processed immediately, but needs to wait for the vLLM scheduler to schedule the prefill and decode.
vllm:num_requests_running
vllm:num_requests_waiting
vllm:num_requests_swapped
The number of requests running inference, waiting, and swapped to memory. It can be used to evaluate the current request pressure of the vLLM service.
vllm:avg_generation_throughput_toks_per_s
vllm:avg_prompt_throughput_toks_per_s
The number of tokens consumed by the prefill stage and generated by the decode stage per second.
vllm:time_to_first_token_seconds_bucket
The latency level from the time the request is sent to the vLLM service to the time the first token is responded to. This metric usually represents the time it takes for the client to get the first response after outputting the request content and is an important metric affecting the LLM user experience.
Configure a Grafana dashboard to monitor LLM inference services.
Observe LLM inference services deployed with vLLM through the Grafana dashboard:
Monitor the request rate and token throughput using ASM monitoring metrics;
Assess the internal state of the workloads for the LLM inference services with vLLM monitoring metrics.
You can create a data source (Prometheus instances) in the Grafana console. Ensure the monitoring metrics for ASM and vLLM has been collected by the Prometheus instances.
To create an observability dashboard for LLM inference services, import the content provided below into Grafana.
You can see the dashboard similar to the following:

(Optional) Compare performance with traditional load balancing using an observability dashboard
Using the observability dashboard, you can directly compare the performance of LLM inference service load balancing with traditional load balancing algorithms, including metrics like cache utilization, request queue time, token throughput, and ttft.
Once you have completed Step 3, you can run the following command to purge the resources you created.
kubectl delete inferencemodel --all
kubectl delete inferencepool --all
kubectl delete llmroute --allOtherwise, after completing steps 1 and 2, ensure you clear any created virtual services before proceeding to Step 3.
Create a virtual service to provide routing and traditional load balancing for the sample LLM inference service by executing the following command.
kubectl apply -f- <<EOF apiVersion: networking.istio.io/v1 kind: VirtualService metadata: name: llm-vs namespace: default spec: gateways: - default/llm-inference-gateway hosts: - '*' http: - name: any-host route: - destination: host: vllm-llama2-7b-pool.default.svc.cluster.local port: number: 8000 EOFPerform stress testing on the LLM inference service using the llmperf tool.
Analyze the two routing and load balancing policies through the Grafana dashboard.
The comparison shows that LLM inference service load balancing provides better latency, throughput, and cache utilization.
