Deploying large language models (LLMs) with fine-tuning based on Low-Rank Adaptation (LoRA) technology in Kubernetes clusters is a flexible and efficient way to offer customized inference capabilities. This topic explains how to fine-tune LLM inference services using Multi-LoRA within Service Mesh (ASM), define traffic distribution policies for various LoRA models, and implement LoRA model gray releases.
Before you begin
Before reading this topic, you need to understand:
How to use GPU computing power in a Container Compute Service (ACS) cluster.
How to add a GPU-accelerated node pool to a Container Service for Kubernetes (ACK) cluster or use the computing power of ACS in an ACK Pro cluster.
By reading this topic, you can learn about:
Background information on LoRA and Multi-LoRA technologies.
Implementation principles of gray release scenarios for LoRA fine-tuning models.
Procedures for implementing LoRA model gray releases using Multi-LoRA technology.
Background information
LoRA and Multi-LoRA
LoRA is a widely adopted technology for fine-tuning large language models (LLMs) cost-effectively to meet the specific needs of various sectors, including healthcare, finance, and education. It enables the deployment of multiple LoRA model weights on a single base LLM for inference, allowing for efficient GPU resource sharing, known as Multi-LoRA technology. The vLLM platform supports the loading and inference of multiple LoRA models.
Gray release scenarios for LoRA fine-tuning models
In Multi-LoRA scenarios, multiple LoRA models can be loaded into a single LLM inference service. Requests for different models are distinguished by the model name in the request, enabling gray testing between various LoRA models to assess the fine-tuning effects on the base LLM.
Prerequisites
Ensure you have either created an ACK managed cluster with a GPU node pool or selected an ACS cluster in a recommended zone for GPU computing power. For more information, see Create an ACK managed cluster and Create an ACS cluster.
You can install the ACK Virtual Node component in your ACK managed cluster to utilize ACS GPU computing capabilities. For more information, see ACS GPU computing power in ACK.
A cluster is added to the ASM instance of v1.24 or later. For more information, see Add a cluster to an ASM instance.
An ingress gateway is created and enable the HTTP service on port 8080. For more information, see Create an ingress gateway.
(Optional) A Sidecar is injected into the default namespace. For more information, see Enable automatic sidecar proxy injection.
NoteYou may skip the Sidecar injection if you do not want to explore the practical operations of observability.
Procedures
This practice involves deploying the Llama2 large model as the base model in the cluster using vLLM and registering 10 LoRA models based on this base model, named sql-lora to sql-lora-4, and tweet-summary to tweet-summary-4. Choose to verify in the ACK clusters with GPU-accelerated nodes or ACS clusters as needed.
Step 1: Deploy the example LLM inference service
Create vllm-service.yaml using the content provided below.
NoteThe image discussed in this topic requires a GPU with more than 16 GiB of video memory. The T4 card type, which has 16 GiB of video memory, does not provide sufficient resources to launch this application. It is recommended to use the A10 card type for ACK clusters and the 8th generation GPU B for ACS clusters. For detailed model information, please submit a ticket for further assistance.
Due to the large size of the LLM image, it is advisable to pre-store it in ACR and use the internal network address for pulling. Pulling directly from the public network may result in long wait times depending on the cluster EIP bandwidth configuration.
ACK cluster
ACS cluster
Deploy the LLM inference service using the kubeconfig file of the cluster on the data plane.
kubectl apply -f vllm-service.yaml
Step 2: Configure ASM gateway rules
Deploy gateway rules to enable port 8080 listening on the ASM gateway.
Create a file named gateway.yaml with the following content.
apiVersion: networking.istio.io/v1 kind: Gateway metadata: name: llm-inference-gateway namespace: default spec: selector: istio: ingressgateway servers: - hosts: - '*' port: name: http-service number: 8080 protocol: HTTPCreate a gateway rule.
kubectl apply -f gateway.yaml
Step 3: Configure LLM inference service routing and load balancing
Enable routing for the LLM inference service using the ASM kubeconfig.
kubectl patch asmmeshconfig default --type=merge --patch='{"spec":{"gatewayAPIInferenceExtension":{"enabled":true}}}'Deploy the InferencePool resource.
The InferencePool resource defines a set of LLM inference service workloads in the cluster through a label selector. ASM will enable vLLM load balancing for the LLM inference services based on the InferencePool you create.
Create inferencepool.yaml using the content provided below.
apiVersion: inference.networking.x-k8s.io/v1alpha1 kind: InferencePool metadata: name: vllm-llama2-7b-pool spec: targetPortNumber: 8000 selector: app: vllm-llama2-7b-poolCreate the InferencePool resource using the kubeconfig of the data plane cluster.
kubectl apply -f inferencepool.yaml
Deploy the InferenceModel resource.
The InferenceModel specifies traffic distribution policies for specific models within the InferencePool.
Create inferencemodel.yaml using the content provided below.
apiVersion: inference.networking.x-k8s.io/v1alpha1 kind: InferenceModel metadata: name: inferencemodel-sample spec: modelName: lora-request poolRef: group: inference.networking.x-k8s.io kind: InferencePool name: vllm-llama2-7b-pool targetModels: - name: tweet-summary weight: 10 - name: tweet-summary-1 weight: 10 - name: tweet-summary-2 weight: 10 - name: tweet-summary-3 weight: 10 - name: tweet-summary-4 weight: 10 - name: sql-lora weight: 10 - name: sql-lora-1 weight: 10 - name: sql-lora-2 weight: 10 - name: sql-lora-3 weight: 10 - name: sql-lora-4 weight: 10This configuration directs 50% of the requests with the model name
lora-requestto the tweet-summary LoRA model for inference, while the remaining 50% are sent to the sql-lora LoRA model.Create the InferenceModel resource.
kubectl apply -f inferencemodel.yaml
Create the LLMRoute resource.
Set up routing rules for the gateway by creating the LLMRoute resource, which directs all requests received on port 8080 to the example LLM inference service, referencing the InferencePool resource.
Create llmroute.yaml using the content provided below.
apiVersion: istio.alibabacloud.com/v1 kind: LLMRoute metadata: name: test-llm-route spec: gateways: - llm-inference-gateway host: test.com rules: - backendRefs: - backendRef: group: inference.networking.x-k8s.io kind: InferencePool name: vllm-llama2-7b-poolCreate the LLMRoute resource.
kubectl apply -f llmroute.yaml
Step 4: Verify the execution result
Run the following command multiple times to initiate the test:
curl -H "host: test.com" ${ASM gateway IP}:8080/v1/completions -H 'Content-Type: application/json' -d '{
"model": "lora-request",
"prompt": "Write as if you were a critic: San Francisco",
"max_tokens": 100,
"temperature": 0
}' -vYou can see the output similar to the following:
{"id":"cmpl-2fc9a351-d866-422b-b561-874a30843a6b","object":"text_completion","created":1736933141,"model":"tweet-summary-1","choices":[{"index":0,"text":", I'm a newbie to this forum. Write a summary of the article.\nWrite a summary of the article.\nWrite a summary of the article. Write a summary of the article. Write a summary of the article. Write a summary of the article. Write a summary of the article. Write a summary of the article. Write a summary of the article. Write a summary of the article. Write a summary of the article. Write a summary of the article. Write a summary","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":2,"total_tokens":102,"completion_tokens":100,"prompt_tokens_details":null}}The model field indicates which model is providing the service. After several requests, you will observe that the request volume ratio between the tweet-summary and sql-lora models is approximately 1:1.
(Optional) Step 5: Configure observability metrics and dashboard for LLM services
After declaring LLM inference services in the cluster using InferencePool and InferenceMode resources and setting up routing policies, you can view the observability of the LLM inference service through logs and monitoring metrics.
Enable LLM traffic observability feature on the ASM console to collect monitoring metrics.
Improve the observability of LLM inference requests by incorporating additional log fields, metrics, and metric dimensions. For detailed configuration instructions, see Traffic observation: Efficiently manage LLM traffic using ASM.
Upon configuration completion, a
modeldimension will be incorporated into the ASM monitoring metrics. You can gather these metrics by either utilizing Prometheus within the observability monitoring framework or integrating a self-hosted Prometheus for service mesh monitoring.ASM introduces two new metrics:
asm_llm_proxy_prompt_tokens, representing the number of input tokens, andasm_llm_proxy_completion_tokens, representing the number of output tokens for all requests. You can incorporate these metrics by adding the following rules to Prometheus. For ins, see Other Prometheus service discovery configurations.scrape_configs: - job_name: asm-envoy-stats-llm scrape_interval: 30s scrape_timeout: 30s metrics_path: /stats/prometheus scheme: http kubernetes_sd_configs: - role: pod relabel_configs: - source_labels: - __meta_kubernetes_pod_container_port_name action: keep regex: .*-envoy-prom - source_labels: - __address__ - __meta_kubernetes_pod_annotation_prometheus_io_port action: replace regex: ([^:]+)(?::\d+)?;(\d+) replacement: $1:15090 target_label: __address__ - action: labelmap regex: __meta_kubernetes_pod_label_(.+) - source_labels: - __meta_kubernetes_namespace action: replace target_label: namespace - source_labels: - __meta_kubernetes_pod_name action: replace target_label: pod_name metric_relabel_configs: - action: keep source_labels: - __name__ regex: asm_llm_.*
Collect monitoring metrics for vLLM service.
The monitoring metrics for LLM inference requests mainly involve the throughput of external LLM inference requests. Add the annotations for Prometheus collector to the vLLM service pod to collect metrics exposed by the vLLM service and monitor its internal state.
... annotations: prometheus.io/path: /metrics # The HTTP path to which the metrics is exposed. prometheus.io/port: "8000" # The port to which the metrics are exposed, which is the listening port of the vLLM Server. prometheus.io/scrape: "true" # Whether to scrape the metrics of the current pod. ...Retrieve metrics related to the vLLM service using Prometheus's default service discovery mechanism. For detailed instructions, see Default service discovery.
Key metrics from the vLLM service provide insight into the internal state of the vLLM workload.
Metric name
Description
vllm:gpu_cache_usage_perc
The percentage of GPU cache usage of vLLM. When vLLM starts, it will pre-occupy as much GPU video memory as possible for KV cache. For the vLLM server, the lower the cache utilization, the more space the GPU has to allocate resources to new requests.
vllm:request_queue_time_seconds_sum
The time spent in the waiting state queue. After the LLM inference request arrives at the vLLM server, it may not be processed immediately, but needs to wait for the vLLM scheduler to schedule the prefill and decode.
vllm:num_requests_running
vllm:num_requests_waiting
vllm:num_requests_swapped
The number of requests running inference, waiting, and swapped to memory. It can be used to evaluate the current request pressure of the vLLM service.
vllm:avg_generation_throughput_toks_per_s
vllm:avg_prompt_throughput_toks_per_s
The number of tokens consumed by the prefill stage and generated by the decode stage per second.
vllm:time_to_first_token_seconds_bucket
The latency level from the time the request is sent to the vLLM service to the time the first token is responded to. This metric usually represents the time it takes for the client to get the first response after outputting the request content and is an important metric affecting the LLM user experience.
Configure a Grafana dashboard to monitor LLM inference services.
Observe LLM inference services deployed with vLLM through the Grafana dashboard:
Monitor the request rate and token throughput using ASM monitoring metrics;
Assess the internal state of the workloads for the LLM inference services with vLLM monitoring metrics.
You can create a data source (Prometheus instances) in the Grafana console. Ensure the monitoring metrics for ASM and vLLM has been collected by the Prometheus instances.
To create an observability dashboard for LLM inference services, import the content provided below into Grafana.
You can see the dashboard similar to the following:
