Deploy standalone LLM inference services - Container Service for Kubernetes

This topic uses the Qwen3-32B model as an example to demonstrate how to deploy a standalone large language model (LLM) inference service in Container Service for Kubernetes (ACK) clusters using vLLM and SGLang.

Background

Qwen3-32B
Qwen3-32B represents the latest evolution in the Qwen series, featuring a 32.8B-parameter dense architecture optimized for both reasoning efficiency and conversational fluency.
Key features:
- Dual-mode performance: Excels at complex tasks like logical reasoning, math, and code generation, while remaining highly efficient for general text generation.
- Advanced capabilities: Demonstrates excellent performance in instruction following, multi-turn dialog, creative writing, and best-in-class tool use for AI agent tasks.
- Large context window: Natively handles up to 32,000 tokens of context, which can be extended to 131,000 tokens using YaRN technology.
- Multilingual support: Understands and translates over 100 languages, making it ideal for global applications.
For more information, see the blog, GitHub, and documentation.
vLLM
vLLM is a fast and lightweight library designed to optimize LLM inference and serving, significantly increasing throughput and reducing latency.
Core optimizations:
- PagedAttention: An innovative attention algorithm that efficiently manages the Key-Value (KV) cache to minimize memory waste and increase throughput.
- Advanced inference: Improves speed and utilization with continuous batching, speculative decoding, and CUDA/HIP graph acceleration.
- Wide range of parallelism: Supports Tensor Parallelism (TP), Pipeline Parallelism (PP), Data Parallelism (DP), and Expert Parallelism (EP) to scale across multiple GPUs.
- Quantization support: Compatible with popular quantization formats like GPTQ, AWQ, INT4/8, and FP8 to reduce the model's memory footprint.
Broad compatibility:
- Hardware and models: Runs on NVIDIA, AMD, and Intel GPUs and supports mainstream models from Hugging Face and ModelScope (such as Qwen, Llama, Deepseek, and E5-Mistral).
- Standard API: Provides an OpenAI-compatible API, making it easy to integrate into existing applications.
For more information, see vLLM GitHub.
SGLang
SGLang is an inference engine that combines a high-performance backend with a flexible frontend, designed for both LLM and multimodal workloads.
High-performance backend:
- Advanced caching: Features RadixAttention (an efficient prefix cache) and PagedAttention to maximize throughput during complex inference tasks.
- Efficient execution: Uses continuous batching, speculative decoding, PD separation, and multi-LoRA batching to efficiently serve multiple users and fine-tuned models.
- Full parallelism and quantization: Supports TP, PP, DP, and EP parallelism, along with various quantization methods (FP8, INT4, AWQ, GPTQ).
Flexible frontend:
- Powerful programming interface: Enables developers to easily build complex applications with features such as chained generation, control flow, and parallel processing.
- Multimodal and external interaction: Natively supports multimodal inputs (such as text and images) and allows for interaction with external tools, making it ideal for advanced agent workflows.
- Broad model support: Supports generative models (Qwen, DeepSeek, Llama), embedding models (E5-Mistral), and reward models (Skywork).
For more information, see SGLang GitHub.

Prerequisites

A Container Service for Kubernetes (ACK) cluster running Kubernetes 1.22 or later is created, with GPU-accelerated nodes added. For more information, see Create an ACK managed cluster and Add GPU nodes to a cluster.

The process described in this topic requires more than 64 GB of GPU memory. The ecs.gn8is-2x.8xlarge instance type is recommended. For details, see GPU-accelerated compute-optimized instance family gn8is.

Model deployment

Step 1: Prepare the Qwen3-32B model files

Run the following command to download the Qwen3-32B model from ModelScope.
If the git-lfs plugin is not installed, run yum install git-lfs or apt-get install git-lfs to install it. For more installation methods, see Installing Git Large File Storage.
```
git lfs install
GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/Qwen/Qwen3-32B.git
cd Qwen3-32B/
git lfs pull
```
Log on to the OSS console and record the name of your bucket. If you haven't created one, see Create buckets. Create a directory in Object Storage Service (OSS) and upload the model to it.
For more information about how to install and use ossutil, see Install ossutil.
```
ossutil mkdir oss://<your-bucket-name>/Qwen3-32B
ossutil cp -r ./Qwen3-32B oss://<your-bucket-name>/Qwen3-32B
```

Create a persistent volume (PV) named llm-model and a persistent volume claim (PVC) for your cluster. For detailed instructions, see Create a PV and a PVC.

Example using console

Create a PV

Log on to the ACK console. In the navigation pane on the left, click Clusters.
On the Clusters page, find the cluster you want and click its name. In the left navigation pane, choose Volumes > Persistent Volumes.
In the upper-right corner of the Persistent Volumes page, click Create.

In the Create PV dialog box, configure the parameters that are described in the following table.

The following table describes the basic configuration of the sample PV:

Parameter	Description
PV Type	In this example, select OSS.
Volume Name	In this example, enter llm-model.
Access Certificate	Configure the AccessKey ID and AccessKey secret used to access the OSS bucket.
Bucket ID	Select the OSS bucket you created in the preceding step.
OSS Path	Enter the path where the model is located, such as `/Qwen3-32B`.

Create a PVC

On the Clusters page, find the cluster you want and click its name. In the left navigation pane, choose Volumes > Persistent Volume Claims.
In the upper-right corner of the Persistent Volume Claims page, click Create.

In the Create PVC dialog box, configure the parameters that are described in the following table.

The following table describes the basic configuration of the sample PVC.

Configuration Item	Description
PVC Type	In this example, select OSS.
Name	In this example, enter llm-model.
Allocation Mode	In this example, select Existing Volumes.
Existing Volumes	Click the Select PV hyperlink and select the PV that you created.

Example using kubectl

Use the following YAML template to create a file named llm-model.yaml, containing configurations for a Secret, a static PV, and a static PVC.

apiVersion: v1
kind: Secret
metadata:
  name: oss-secret
stringData:
  akId: <your-oss-ak> # The AccessKey ID used to access the OSS bucket.
  akSecret: <your-oss-sk> # The AccessKey secret used to access the OSS bucket.
---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: llm-model
  labels:
    alicloud-pvname: llm-model
spec:
  capacity:
    storage: 30Gi 
  accessModes:
    - ReadOnlyMany
  persistentVolumeReclaimPolicy: Retain
  csi:
    driver: ossplugin.csi.alibabacloud.com
    volumeHandle: llm-model
    nodePublishSecretRef:
      name: oss-secret
      namespace: default
    volumeAttributes:
      bucket: <your-bucket-name> # The bucket name.
      url: <your-bucket-endpoint> # The endpoint, such as oss-cn-hangzhou-internal.aliyuncs.com.
      otherOpts: "-o umask=022 -o max_stat_cache_size=0 -o allow_other"
      path: <your-model-path> # In this example, the path is /Qwen3-32B/.
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: llm-model
spec:
  accessModes:
    - ReadOnlyMany
  resources:
    requests:
      storage: 30Gi
  selector:
    matchLabels:
      alicloud-pvname: llm-model

Create the Secret, static PV, and static PVC.
```
kubectl create -f llm-model.yaml
```

Step 2: Deploy the inference service

Use the following YAML template to deploy a standalone LLM inference service in ACK using the vLLM and SGLang inference engines.

Deploy a standalone inference service with vLLM

Create a file named vllm.yaml.

Expand to view the YAML template

apiVersion: apps/v1
kind: StatefulSet
metadata:
  labels:
    # for prometheus to scrape
    alibabacloud.com/inference-workload: vllm-inference
    alibabacloud.com/inference_backend: vllm
  name: vllm-inference
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      alibabacloud.com/inference-workload: vllm-inference
      alibabacloud.com/inference_backend: vllm
  template:
    metadata:
      labels:
        alibabacloud.com/inference-workload: vllm-inference
        alibabacloud.com/inference_backend: vllm
    spec:
      volumes:
      - name: model
        persistentVolumeClaim:
          claimName: llm-model
      - name: dshm
        emptyDir:
          medium: Memory
          sizeLimit: 15Gi
      containers:
      - command:
        - sh
        - -c
        - vllm serve /models/Qwen3-32B --port 8000 --trust-remote-code --max-model-len 2048 --gpu-memory-utilization 0.85 --tensor-parallel-size 2
#        image: vllm/vllm-openai:v0.10.0
        image: kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm:v0.10.0
        name: vllm
        ports:
        - containerPort: 8000
          name: http
        readinessProbe:
          initialDelaySeconds: 30
          periodSeconds: 10
          tcpSocket:
            port: 8000
        resources:
          limits:
            nvidia.com/gpu: "2"
            memory: "16Gi"
            cpu: "4"
          requests:
            nvidia.com/gpu: "2"
            memory: "16Gi"
            cpu: "4"
        volumeMounts:
        - mountPath: /models/Qwen3-32B
          name: model
        - mountPath: /dev/shm
          name: dshm
---
apiVersion: v1
kind: Service
metadata:
  name: inference-service
spec:
  type: ClusterIP
  ports:
    - port: 8000
      protocol: TCP
      targetPort: 8000
  selector:
    alibabacloud.com/inference-workload: vllm-inference

Deploy the standalone LLM inference service using the vLLM framework.
```
kubectl create -f vllm.yaml
```

Deploy a standalone inference service with SGLang

Create a file named sglang.yaml.

Expand to view the YAML template

apiVersion: apps/v1
kind: StatefulSet
metadata:
  labels:
    # for prometheus to scrape
    alibabacloud.com/inference-workload: sgl-inference
    alibabacloud.com/inference_backend: sglang
  name: sgl-inference
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      alibabacloud.com/inference-workload: sgl-inference
      alibabacloud.com/inference_backend: sglang
  template:
    metadata:
      labels:
        alibabacloud.com/inference-workload: sgl-inference
        alibabacloud.com/inference_backend: sglang
    spec:
      volumes:
        - name: model
          persistentVolumeClaim:
            claimName: llm-model
        - name: dshm
          emptyDir:
            medium: Memory
            sizeLimit: 15Gi
      containers:
        - command:
            - sh
            - -c
            - "python3 -m sglang.launch_server --model-path /models/Qwen3-32B --tp 2 --trust-remote-code  --context-length 2048 --host 0.0.0.0 --port 8000 --enable-metrics"
          image: anolis-registry.cn-zhangjiakou.cr.aliyuncs.com/anolis-docker-images/docker-temp:0.3.4.post2-sglang0.4.10.post2-pytorch2.7.1.8-cuda12.8.1-py312-alinux3.2104
          name: sglang
          ports:
            - containerPort: 8000
              name: http
          readinessProbe:
            initialDelaySeconds: 30
            periodSeconds: 10
            tcpSocket:
              port: 8000
          resources:
            limits:
              nvidia.com/gpu: "2"
              memory: "16Gi"
              cpu: "4"
            requests:
              nvidia.com/gpu: "2"
              memory: "16Gi"
              cpu: "4"
          volumeMounts:
            - mountPath: /models/Qwen3-32B
              name: model
            - mountPath: /dev/shm
              name: dshm
---
apiVersion: v1
kind: Service
metadata:
  name: inference-service
spec:
  type: ClusterIP
  ports:
    - port: 8000
      protocol: TCP
      targetPort: 8000
  selector:
    alibabacloud.com/inference-workload: sgl-inference

Deploy the standalone LLM inference service using the SGLang framework.
```
kubectl create -f sglang.yaml
```

Step 3: Validate the inference service

Run the following command to establish port forwarding between the inference service and your local environment.
Important
Port forwarding established by kubectl port-forward lacks production-grade reliability, security, and scalability. It is suitable for development and debugging purposes only and should not be used in production environment. For production-ready network solutions in Kubernetes clusters, see Ingress management.
```
kubectl port-forward svc/inference-service 8000:8000
```
Expected output:
```
Forwarding from 127.0.0.1:8000 -> 8000
Forwarding from [::1]:8000 -> 8000
```

Run the following command to send a sample request to the model inference service:

curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json"  -d '{"model": "/models/Qwen3-32B", "messages": [{"role": "user", "content": "Test"}], "max_tokens": 30, "temperature": 0.7, "top_p": 0.9, "seed": 10}'

Expected output:

{"id":"chatcmpl-d490443cd4094bdf86a1a49144f77444","object":"chat.completion","created":1753684011,"model":"/models/Qwen3-32B","choices":[{"index":0,"message":{"role":"assistant","reasoning_content":null,"content":"<think>\nOkay, the user sent \"Test\". I need to confirm their request first. They might be testing my functionality or want to see my reaction.","tool_calls":[]},"logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":10,"total_tokens":40,"completion_tokens":30,"prompt_tokens_details":null},"prompt_logprobs":null,"kv_transfer_params":null}

The output indicates that the model can generate a response based on the given input, which is a test message in this example.

References

Configure Prometheus monitoring for LLM inference services
In a production environment, monitoring the health and performance of your LLM service is critical for maintaining stability. By integrating with Managed Service for Prometheus, you can collect detailed metrics to:
- Detect failures and performance bottlenecks.
- Troubleshoot issues with real-time data.
- Analyze long-term performance trends to optimize resource allocation.
Configure auto scaling for LLM inference services
LLM workloads often fluctuate, leading to either over-provisioned resources or poor performance during traffic spikes. The Kubernetes Horizontal Pod Autoscaler (HPA), integrated with ack-alibaba-cloud-metrics-adapter, solves this by:
- Automatically scaling your pods based on real-time GPU, CPU, and memory utilization.
- Allowing you to define custom metrics for more sophisticated scaling triggers.
- Ensuring high availability during peak demand while reducing costs during idle periods.
Implement intelligent routing and traffic management by using Gateway with Inference Extension
ACK Gateway with Inference Extension is a powerful ingress controller built on the Kubernetes Gateway API to simplify and optimize routing for AI/ML workloads. Key features include:
- Model-aware load balancing: Provides optimized load balancing policies to ensure efficient distribution of inference requests.
- Intelligent model routing: Routes traffic based on the model name in the request payload. This is ideal for managing multiple fine-tuned models (e.g., different LoRA variants) behind a single endpoint or for implementing traffic splitting for canary releases.
- Request prioritization: Assigns priority levels to different models, ensuring that requests to your most critical models are processed first, guaranteeing quality of service.
Accelerate model loading with Fluid distributed caching
Large model files (>10 GB) stored in services like OSS or File Storage NAS can cause slow pod startups (cold starts) due to long download times. Fluid solves this problem by creating a distributed caching layer across your cluster's nodes. This significantly accelerates model loading in two key ways:
- Accelerated data throughput: Fluid pools the storage capacity and network bandwidth of all nodes in the cluster. This creates a high-speed, parallel data layer that overcomes the bottleneck of pulling large files from a single remote source.
- Reduced I/O latency: By caching model files directly on the compute nodes where they are needed, Fluid provides applications with local, near-instant access to data. This optimized read mechanism eliminates the long delays associated with network I/O.