When you run multiple LoRA (Low-Rank Adaptation) adapters on a shared base model, you need a way to control how inference traffic is distributed across them. Service Mesh (ASM) lets you assign weights to each LoRA adapter through an InferenceModel resource, so you can run canary releases and A/B tests -- gradually shifting traffic from one adapter version to another and validating results before a full rollout.
How it works
LoRA and Multi-LoRA
LoRA is a widely adopted technique for fine-tuning large language models (LLMs) cost-effectively. Instead of retraining the entire model, LoRA adds lightweight adapter layers that load alongside the base model at inference time.
Multi-LoRA extends this approach: multiple LoRA adapters share a single base model and GPU, each serving a different fine-tuned variant. The vLLM platform supports loading and serving multiple LoRA adapters simultaneously, routing requests by the model field in each API call.
Traffic distribution through ASM
In a Multi-LoRA deployment, ASM routes inference requests to different LoRA adapters based on the model name in each request. You define an InferenceModel resource that maps a virtual model name to a set of target adapters with traffic weights. ASM then distributes incoming requests according to those weights.
A typical canary release workflow looks like this:
Route 100% of traffic to the current adapter version.
Deploy the new adapter version and shift 10% of traffic to it.
Monitor metrics. If the new version performs well, gradually increase its traffic share.
After validation, route 100% of traffic to the new version.
Prerequisites
Before you begin, make sure that you have:
An ACK managed cluster with a GPU node pool, or an ACS cluster in a zone that supports GPU computing power. For more information, see Create an ACK managed cluster and Create an ACS cluster. To use ACS GPU resources from an ACK cluster, install the ACK Virtual Node component. For more information, see Use ACS computing power in an ACK cluster.
An ASM instance (v1.24 or later) with the cluster added. For more information, see Add a cluster to an ASM instance.
An ingress gateway with an HTTP service on port 8080. For more information, see Create an ingress gateway.
(Optional) Sidecar proxy injection enabled in the
defaultnamespace, required only for observability. For more information, see Enable automatic sidecar proxy injection.
Deploy the vLLM inference service
This step deploys a Llama-2-7b base model on vLLM with 10 LoRA adapters: sql-lora through sql-lora-4 (5 SQL-focused adapters) and tweet-summary through tweet-summary-4 (5 summarization adapters).
Create a file named
vllm-service.yamlwith the following content.ACK cluster
ACS cluster
The YAML defines three resources:
Resource Purpose Service ( vllm-llama2-7b-pool)Exposes the vLLM server on port 8000 as a ClusterIP service ConfigMap ( chat-template)Provides the Llama-2 chat prompt template Deployment ( vllm-llama2-7b-pool)Runs 3 replicas of the vLLM server with the base model and all 10 LoRA adapters loaded Key vLLM parameters:
Parameter Value Description --enable-lora- Enables LoRA adapter support --max-loras10 Maximum number of LoRA adapters loaded in GPU memory simultaneously --max-cpu-loras12 Maximum number of LoRA adapters stored in CPU memory --gpu_memory_utilization0.8 Fraction of GPU memory allocated for KV cache --lora-modules<name>=<path>Maps adapter names to their weight file paths apiVersion: v1 kind: Service metadata: name: vllm-llama2-7b-pool spec: selector: app: vllm-llama2-7b-pool ports: - protocol: TCP port: 8000 targetPort: 8000 type: ClusterIP --- apiVersion: v1 kind: ConfigMap metadata: name: chat-template data: llama-2-chat.jinja: | {% if messages[0]['role'] == 'system' %} {% set system_message = '<<SYS>>\n' + messages[0]['content'] | trim + '\n<</SYS>>\n\n' %} {% set messages = messages[1:] %} {% else %} {% set system_message = '' %} {% endif %} {% for message in messages %} {% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %} {{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }} {% endif %} {% if loop.index0 == 0 %} {% set content = system_message + message['content'] %} {% else %} {% set content = message['content'] %} {% endif %} {% if message['role'] == 'user' %} {{ bos_token + '[INST] ' + content | trim + ' [/INST]' }} {% elif message['role'] == 'assistant' %} {{ ' ' + content | trim + ' ' + eos_token }} {% endif %} {% endfor %} --- apiVersion: apps/v1 kind: Deployment metadata: name: vllm-llama2-7b-pool namespace: default spec: replicas: 3 selector: matchLabels: app: vllm-llama2-7b-pool template: metadata: annotations: prometheus.io/path: /metrics prometheus.io/port: '8000' prometheus.io/scrape: 'true' labels: app: vllm-llama2-7b-pool spec: containers: - name: lora image: "registry-cn-hangzhou-vpc.ack.aliyuncs.com/dev/llama2-with-lora:v0.2" imagePullPolicy: IfNotPresent command: ["python3", "-m", "vllm.entrypoints.openai.api_server"] args: - "--model" - "/model/llama2" - "--tensor-parallel-size" - "1" - "--port" - "8000" - '--gpu_memory_utilization' - '0.8' - "--enable-lora" - "--max-loras" - "10" - "--max-cpu-loras" - "12" - "--lora-modules" - 'sql-lora=/adapters/yard1/llama-2-7b-sql-lora-test_0' - 'sql-lora-1=/adapters/yard1/llama-2-7b-sql-lora-test_1' - 'sql-lora-2=/adapters/yard1/llama-2-7b-sql-lora-test_2' - 'sql-lora-3=/adapters/yard1/llama-2-7b-sql-lora-test_3' - 'sql-lora-4=/adapters/yard1/llama-2-7b-sql-lora-test_4' - 'tweet-summary=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_0' - 'tweet-summary-1=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_1' - 'tweet-summary-2=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_2' - 'tweet-summary-3=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_3' - 'tweet-summary-4=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_4' - '--chat-template' - '/etc/vllm/llama-2-chat.jinja' env: - name: PORT value: "8000" ports: - containerPort: 8000 name: http protocol: TCP livenessProbe: failureThreshold: 2400 httpGet: path: /health port: http scheme: HTTP initialDelaySeconds: 5 periodSeconds: 5 successThreshold: 1 timeoutSeconds: 1 readinessProbe: failureThreshold: 6000 httpGet: path: /health port: http scheme: HTTP initialDelaySeconds: 5 periodSeconds: 5 successThreshold: 1 timeoutSeconds: 1 resources: limits: cpu: 16 memory: 64Gi nvidia.com/gpu: 1 requests: cpu: 8 memory: 30Gi nvidia.com/gpu: 1 volumeMounts: - mountPath: /data name: data - mountPath: /dev/shm name: shm - mountPath: /etc/vllm name: chat-template restartPolicy: Always schedulerName: default-scheduler terminationGracePeriodSeconds: 30 volumes: - name: data emptyDir: {} - name: shm emptyDir: medium: Memory - name: chat-template configMap: name: chat-templateapiVersion: v1 kind: Service metadata: name: vllm-llama2-7b-pool spec: selector: app: vllm-llama2-7b-pool ports: - protocol: TCP port: 8000 targetPort: 8000 type: ClusterIP --- apiVersion: v1 kind: ConfigMap metadata: name: chat-template data: llama-2-chat.jinja: | {% if messages[0]['role'] == 'system' %} {% set system_message = '<<SYS>>\n' + messages[0]['content'] | trim + '\n<</SYS>>\n\n' %} {% set messages = messages[1:] %} {% else %} {% set system_message = '' %} {% endif %} {% for message in messages %} {% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %} {{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }} {% endif %} {% if loop.index0 == 0 %} {% set content = system_message + message['content'] %} {% else %} {% set content = message['content'] %} {% endif %} {% if message['role'] == 'user' %} {{ bos_token + '[INST] ' + content | trim + ' [/INST]' }} {% elif message['role'] == 'assistant' %} {{ ' ' + content | trim + ' ' + eos_token }} {% endif %} {% endfor %} --- apiVersion: apps/v1 kind: Deployment metadata: name: vllm-llama2-7b-pool namespace: default spec: replicas: 3 selector: matchLabels: app: vllm-llama2-7b-pool template: metadata: annotations: prometheus.io/path: /metrics prometheus.io/port: '8000' prometheus.io/scrape: 'true' labels: app: vllm-llama2-7b-pool alibabacloud.com/compute-class: gpu alibabacloud.com/compute-qos: default alibabacloud.com/gpu-model-series: "example-model" # Replace with your actual GPU model series spec: containers: - name: lora image: "registry-cn-hangzhou-vpc.ack.aliyuncs.com/dev/llama2-with-lora:v0.2" imagePullPolicy: IfNotPresent command: ["python3", "-m", "vllm.entrypoints.openai.api_server"] args: - "--model" - "/model/llama2" - "--tensor-parallel-size" - "1" - "--port" - "8000" - '--gpu_memory_utilization' - '0.8' - "--enable-lora" - "--max-loras" - "10" - "--max-cpu-loras" - "12" - "--lora-modules" - 'sql-lora=/adapters/yard1/llama-2-7b-sql-lora-test_0' - 'sql-lora-1=/adapters/yard1/llama-2-7b-sql-lora-test_1' - 'sql-lora-2=/adapters/yard1/llama-2-7b-sql-lora-test_2' - 'sql-lora-3=/adapters/yard1/llama-2-7b-sql-lora-test_3' - 'sql-lora-4=/adapters/yard1/llama-2-7b-sql-lora-test_4' - 'tweet-summary=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_0' - 'tweet-summary-1=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_1' - 'tweet-summary-2=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_2' - 'tweet-summary-3=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_3' - 'tweet-summary-4=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_4' - '--chat-template' - '/etc/vllm/llama-2-chat.jinja' env: - name: PORT value: "8000" ports: - containerPort: 8000 name: http protocol: TCP livenessProbe: failureThreshold: 2400 httpGet: path: /health port: http scheme: HTTP initialDelaySeconds: 5 periodSeconds: 5 successThreshold: 1 timeoutSeconds: 1 readinessProbe: failureThreshold: 6000 httpGet: path: /health port: http scheme: HTTP initialDelaySeconds: 5 periodSeconds: 5 successThreshold: 1 timeoutSeconds: 1 resources: limits: cpu: 16 memory: 64Gi nvidia.com/gpu: 1 requests: cpu: 8 memory: 30Gi nvidia.com/gpu: 1 volumeMounts: - mountPath: /data name: data - mountPath: /dev/shm name: shm - mountPath: /etc/vllm name: chat-template restartPolicy: Always schedulerName: default-scheduler terminationGracePeriodSeconds: 30 volumes: - name: data emptyDir: {} - name: shm emptyDir: medium: Memory - name: chat-template configMap: name: chat-templateDeploy the service using the data plane cluster kubeconfig:
kubectl apply -f vllm-service.yamlVerify that all pods are running: Wait until all 3 replicas show
Runningstatus andREADYis1/1(or2/2if sidecar injection is enabled).kubectl get pods -l app=vllm-llama2-7b-pool
Configure ASM gateway rules
Set up the ASM ingress gateway to listen on port 8080 for HTTP traffic.
Create a file named
gateway.yaml:apiVersion: networking.istio.io/v1 kind: Gateway metadata: name: llm-inference-gateway namespace: default spec: selector: istio: ingressgateway servers: - hosts: - '*' port: name: http-service number: 8080 protocol: HTTPApply the gateway rule using the ASM kubeconfig:
kubectl apply -f gateway.yaml
Configure routing and traffic distribution
This step creates three resources that work together to route and distribute inference traffic:
| Resource | Role |
|---|---|
| InferencePool | Selects the vLLM pods that serve inference requests |
| InferenceModel | Defines which LoRA adapters receive traffic and at what weights |
| LLMRoute | Connects the ingress gateway to the InferencePool |
Enable the Gateway API inference extension
Run the following command using the ASM kubeconfig:
kubectl patch asmmeshconfig default --type=merge \
--patch='{"spec":{"gatewayAPIInferenceExtension":{"enabled":true}}}'Create the InferencePool
The InferencePool selects vLLM pods by label and enables ASM to perform inference-aware load balancing across them.
Create a file named
inferencepool.yaml:apiVersion: inference.networking.x-k8s.io/v1alpha1 kind: InferencePool metadata: name: vllm-llama2-7b-pool spec: targetPortNumber: 8000 selector: app: vllm-llama2-7b-poolApply the resource using the data plane cluster kubeconfig:
kubectl apply -f inferencepool.yaml
Create the InferenceModel
The InferenceModel maps a virtual model name to a set of LoRA adapters with traffic weights. When a request specifies model: "lora-request", ASM distributes it to one of the target adapters based on the configured weights.
Create a file named
inferencemodel.yaml: In this example, all 10 adapters have equal weight (10), so traffic splits evenly: 50% to thetweet-summarygroup and 50% to thesql-loragroup. Adjust weights for a canary release To gradually shift traffic between adapter groups, change the weights. For example, to route 90% of traffic tosql-loraadapters and 10% totweet-summaryadapters: After validating the new adapter group, increase its weight further until you reach 100%.apiVersion: inference.networking.x-k8s.io/v1alpha1 kind: InferenceModel metadata: name: inferencemodel-sample spec: modelName: lora-request poolRef: group: inference.networking.x-k8s.io kind: InferencePool name: vllm-llama2-7b-pool targetModels: - name: tweet-summary weight: 10 - name: tweet-summary-1 weight: 10 - name: tweet-summary-2 weight: 10 - name: tweet-summary-3 weight: 10 - name: tweet-summary-4 weight: 10 - name: sql-lora weight: 10 - name: sql-lora-1 weight: 10 - name: sql-lora-2 weight: 10 - name: sql-lora-3 weight: 10 - name: sql-lora-4 weight: 10targetModels: - name: sql-lora weight: 18 # 90 / 5 adapters = 18 per adapter - name: sql-lora-1 weight: 18 # ... remaining sql-lora adapters with weight: 18 - name: tweet-summary weight: 2 # 10 / 5 adapters = 2 per adapter - name: tweet-summary-1 weight: 2 # ... remaining tweet-summary adapters with weight: 2Apply the resource:
kubectl apply -f inferencemodel.yaml
Create the LLMRoute
The LLMRoute binds the gateway to the InferencePool, routing all requests on port 8080 to the inference service.
Create a file named
llmroute.yaml:apiVersion: istio.alibabacloud.com/v1 kind: LLMRoute metadata: name: test-llm-route spec: gateways: - llm-inference-gateway host: test.com rules: - backendRefs: - backendRef: group: inference.networking.x-k8s.io kind: InferencePool name: vllm-llama2-7b-poolApply the resource using the ASM kubeconfig:
kubectl apply -f llmroute.yaml
Verify the traffic distribution
Before sending test traffic, confirm that all resources are ready.
Check that the InferencePool and InferenceModel resources exist:
kubectl get inferencepool vllm-llama2-7b-pool kubectl get inferencemodel inferencemodel-sampleSend multiple requests through the gateway: Replace
${ASM_GATEWAY_IP}with the IP address of your ASM ingress gateway.curl -H "host: test.com" ${ASM_GATEWAY_IP}:8080/v1/completions \ -H 'Content-Type: application/json' \ -d '{ "model": "lora-request", "prompt": "Write as if you were a critic: San Francisco", "max_tokens": 100, "temperature": 0 }' -vCheck the response. The
modelfield indicates which LoRA adapter served the request: After sending multiple requests, the ratio betweentweet-summaryandsql-loraresponses should be approximately 1:1.{ "id": "cmpl-2fc9a351-d866-422b-b561-874a30843a6b", "object": "text_completion", "created": 1736933141, "model": "tweet-summary-1", "choices": [ { "index": 0, "text": "...", "finish_reason": "length" } ], "usage": { "prompt_tokens": 2, "total_tokens": 102, "completion_tokens": 100 } }
(Optional) Set up observability
After you configure the InferencePool, InferenceModel, and LLMRoute resources, monitor the LLM inference service through ASM metrics and vLLM metrics.
Collect ASM metrics
Enable LLM traffic observability in the ASM console. This adds model-level dimensions to ASM monitoring metrics, including log fields and metric labels. For detailed instructions, see Efficiently manage LLM traffic using ASM.
Collect the metrics using Prometheus within the observability framework or a self-hosted Prometheus instance.
ASM exposes two LLM-specific metrics: Add the following scrape configuration to Prometheus. For details, see Other Prometheus service discovery configurations.
Metric Description asm_llm_proxy_prompt_tokensNumber of input tokens per request asm_llm_proxy_completion_tokensNumber of output tokens per request scrape_configs: - job_name: asm-envoy-stats-llm scrape_interval: 30s scrape_timeout: 30s metrics_path: /stats/prometheus scheme: http kubernetes_sd_configs: - role: pod relabel_configs: - source_labels: - __meta_kubernetes_pod_container_port_name action: keep regex: .*-envoy-prom - source_labels: - __address__ - __meta_kubernetes_pod_annotation_prometheus_io_port action: replace regex: ([^:]+)(?::\d+)?;(\d+) replacement: $1:15090 target_label: __address__ - action: labelmap regex: __meta_kubernetes_pod_label_(.+) - source_labels: - __meta_kubernetes_namespace action: replace target_label: namespace - source_labels: - __meta_kubernetes_pod_name action: replace target_label: pod_name metric_relabel_configs: - action: keep source_labels: - __name__ regex: asm_llm_.*
Collect vLLM metrics
The deployment YAML already includes Prometheus annotations on the vLLM pods, so metrics are scraped automatically: Retrieve vLLM metrics through Prometheus default service discovery. For details, see Default service discovery.
annotations: prometheus.io/path: /metrics prometheus.io/port: "8000" prometheus.io/scrape: "true"Key vLLM metrics:
Metric Description vllm:gpu_cache_usage_percGPU KV cache utilization. Lower values mean more capacity for new requests. vllm:request_queue_time_seconds_sumTime requests spend waiting in the queue before inference begins. vllm:num_requests_runningNumber of requests currently running inference. vllm:num_requests_waitingNumber of requests waiting in the queue. vllm:num_requests_swappedNumber of requests swapped to CPU memory. vllm:avg_generation_throughput_toks_per_sDecode-stage token throughput (tokens/second). vllm:avg_prompt_throughput_toks_per_sPrefill-stage token throughput (tokens/second). vllm:time_to_first_token_seconds_bucketTime from request arrival to first token output (TTFT). A key metric for user-perceived latency. vllm:e2e_request_latency_seconds_bucketEnd-to-end request latency.
Build a Grafana dashboard
Build a Grafana dashboard to visualize both ASM and vLLM metrics:
ASM metrics: Track request rate and token throughput per model.
vLLM metrics: Monitor GPU cache utilization, queue depth, and per-request latency.
To set up the dashboard:
Add your Prometheus instance as a data source in the Grafana console.
Import the dashboard JSON provided below.
The following figure shows an example of the dashboard:
