Container Compute Service (ACS) provides efficient and flexible container management and orchestration capabilities to enable you to deploy large language models (LLMs) and run LLM inference services. This topic describes how to deploy an LLM inference service from a container image in ACS.
Preparations
The information of the container image for deploying the LLM inference service in ACS is obtained.
In the left-side navigation pane, click Artifact Center.
In the Repository Name search box, enter
llm-inferenceand find theegslingjun/llm-inferenceoregslingjun/inference-nv-pytorchimage.The
egslingjun/llm-inferenceandegslingjun/inference-nv-pytorchimages support the vLLM inference library and the DeepGPU-LLM inference engine. They can help you quickly set up an inference environment for LLMs such as Llama models, ChatGLM models, Baichuan models, or Qwen models. The images are updated every month. The following table describes details of the container images.Image name
Image tag
Component information
Image address
llm-inference
vllm0.4.2-deepgpu-llm24.5-pytorch2.3.0-cuda12.1-ubuntu22.04
Base image: Ubuntu 22.04
Python 3.10
Torch 2.3.0
CUDA 12.1
vLLM 0.4.2
deepgpu-llm 24.5+pt2.3cu121
egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/llm-inference:vllm0.4.2-deepgpu-llm24.5-pytorch2.3.0-cuda12.1-ubuntu22.04
vllm0.4.3-deepgpu-llm24.6-pytorch2.4.0-cuda12.4-ubuntu22.04
Base image: Ubuntu 22.04
Python 3.10
Torch 2.4.0
CUDA 12.4.1
vLLM 0.4.3
deepgpu-llm 24.6+pt2.4cu124
egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/llm-inference:vllm0.4.3-deepgpu-llm24.6-pytorch2.4.0-cuda12.4-ubuntu22.04
vllm0.5.4-deepgpu-llm24.7-pytorch2.4.0-cuda12.4-ubuntu22.04
Base image: Ubuntu 22.04
Python 3.10
Torch 2.4.0
CUDA 12.4.1
vLLM 0.5.4
deepgpu-llm 24.7.2+pt2.4cu124
egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/llm-inference:vllm0.5.4-deepgpu-llm24.7-pytorch2.4.0-cuda12.4-ubuntu22.04
vllm0.6.3-post1-deepgpu-llm24.9-pytorch2.4.0-cuda12.4-ubuntu22.04
Base image: Ubuntu 22.04
Python 3.10
Torch 2.4.0
CUDA 12.4.1
vLLM 0.6.3.post1
deepgpu-llm 24.9+pt2.4cu124
egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/llm-inference:vllm0.6.3-post1-deepgpu-llm24.9-pytorch2.4.0-cuda12.4-ubuntu22.04
vllm0.6.4.post1-deepgpu-llm24.10-pytorch2.5.1-cuda12.4-ubuntu22.04-201412
Base image: Ubuntu 22.04
Python 3.10
Torch 2.5.1
CUDA 12.4.1
vLLM 0.6.4.post1
deepgpu-llm 24.10+pt2.5cu124
egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/llm-inference:vllm0.6.4.post1-deepgpu-llm24.10-pytorch2.5.1-cuda12.4-ubuntu22.04-201412
inference-nv-pytorch
25.01-vllm0.6.5-deepgpu-llm24.10-pytorch2.5-cuda12.4-20250121
Base image: Ubuntu 22.04
Python 3.10
Torch 2.5.1
CUDA 12.4.1
vLLM 0.6.5
deepgpu-llm 24.10+pt2.5cu124
egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:25.01-vllm0.6.5-deepgpu-llm24.10-pytorch2.5-cuda12.4-20250121
A NAS file system is prepared. The LLM requires large amounts of disk space to store model files. We recommend that you create a NAS volume to store the model files. For more information, see Mount a statically provisioned NAS volume or Mount a dynamically provisioned NAS volume. In this example, a statically provisioned NAS volume is used and the suggested volume size is 20 GiB.
Create a PVC based on the following parameters. For more information, see Create a PVC.
Parameter
Example
Name
nas-test.
Allocation Mode
Select Use Mount Target Domain Name.
Capacity
20 GiB.
Mount Target Domain Name
08cxxxxxxec-wxxxxxxcn-hangzhou.nas.aliyuncs.com
Mount the NAS file system to an ECS instance. For more information, see Mount an NFS file system in the NAS console.
Run the following command to download the model to the NAS file system.
cd /mnt pip install modelscope modelscope download --model Qwen/Qwen2.5-7B-Instruct --local_dir ./Qwen2.5-7B-InstructNoteThe Qwen2.5-7B model in this example is about 15 GB in size. It requires 20 minutes to download it at 100 Mbit/s. You can choose the Qwen2.5-3B or Qwen2.5-0.5B model, which is more lightweight.
modelscoperequires Python 3.10 or later. We recommend that you purchase an ECS instance that runs Ubuntu 22.04 because it uses Python 3.10 by default. You can also specify a public or custom image that uses Python 3.10 or later.
Deploy the LLM inference service
Use kubectl to connect to the ACS cluster. For more information, see Obtain the kubeconfig file of a cluster and use kubectl to connect to the cluster.
Create a file named your-deployment.yaml based on the following content.
NoteFor more information about the GPU model used in this example, see GPU models.
apiVersion: apps/v1 kind: Deployment metadata: labels: app: my-deployment name: my-deployment namespace: default spec: progressDeadlineSeconds: 600 replicas: 1 revisionHistoryLimit: 10 selector: matchLabels: app: my-deployment template: metadata: labels: # Set the GPU model to example-model. The value is for reference only. alibabacloud.com/gpu-model-series: example-model # Set the compute class to gpu. alibabacloud.com/compute-class: gpu # If you want to use BestEffort pods, set the following parameter to best-effort. alibabacloud.com/compute-qos: default app: my-deployment spec: containers: - command: - sh - -c - python3 -m vllm.entrypoints.openai.api_server --model /mnt/Qwen2.5-7B-Instruct --trust-remote-code --tensor-parallel-size 1 --disable-custom-all-reduce image: egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/llm-inference:vllm0.5.4-deepgpu-llm24.7-pytorch2.4.0-cuda12.4-ubuntu22.04 imagePullPolicy: IfNotPresent #Always name: my-deployment resources: limits: cpu: 16 memory: 64Gi nvidia.com/gpu: "1" requests: cpu: 16 memory: 64Gi nvidia.com/gpu: "1" terminationMessagePath: /dev/termination-log terminationMessagePolicy: File volumeMounts: - mountPath: /mnt name: nas-test - mountPath: /dev/shm name: cache-volume dnsPolicy: ClusterFirst restartPolicy: Always schedulerName: default-scheduler securityContext: {} terminationGracePeriodSeconds: 30 volumes: - name: nas-test persistentVolumeClaim: #The content of the PVC. claimName: nas-test - name: cache-volume emptyDir: medium: Memory sizeLimit: 64G --- apiVersion: v1 kind: Service metadata: annotations: service.beta.kubernetes.io/alibaba-cloud-loadbalancer-address-type: "internet" service.beta.kubernetes.io/alibaba-cloud-loadbalancer-ip-version: ipv4 labels: app: my-deployment name: svc-llm namespace: default spec: externalTrafficPolicy: Local ports: - name: serving port: 8000 protocol: TCP targetPort: 8000 selector: app: my-deployment type: LoadBalancerRun the following command to deploy the LLM inference task and Service.
kubectl apply -f your-deployment.yamlIt requires 20 minutes to deploy the model due to the size of the image. You can run the following command to query the deployment progress.
kubectl logs -f my-deployment-787b8xxxxx-xxxxxIf the following output is returned, the model is successfully deployed.
INFO: Started server process [2] INFO: Waiting for application startup. INFO: Application startup complete. INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit) INFO 12-13 12:39:11 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%. INFO 12-13 12:39:21 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%. INFO 12-13 12:39:31 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%. INFO 12-13 12:39:41 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.Run the following command to query the Service.
kubectl get svcExpected results:
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE kubernetes ClusterIP 10.0.0.1 <none> 443/TCP 3h38m svc-llm LoadBalancer 10.0.143.103 112.xxx.xxx.177 8000/TCP 58sEXTERNAL-IPdisplays the IP address that is used to expose the Service. Record this IP address for testing.
Test the LLM inference service
Run the following command and enter Prompt.
export EXTERNAL_IP=112.xxx.xxx.177
curl http://$EXTERNAL_IP:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{ "model": "/mnt/Qwen2.5-7B-Instruct", "messages": [ {"role": "system", "content": "You are a friendly AI assistant"}, {"role": "user", "content": "Introduce deep learning"} ]}'Expected results:
{"id":"chat-edab465b4b5547bda7xxxxxxxxxxxxxxx","object":"chat.completion","created":1734094178,"model":"/mnt/Qwen2.5-7B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"Deep learning is a subset of machine learning that focuses on utilizing neural networks to process and study large amounts of data to enable computers to learn and make decisions as humans. Deep learning consists of CNN, RNN, and LSTM. These models can recognize and classify images, extract features, perform natural language processing, and event run speech recognition tasks. Deep learning applies to a wide array of industries, including image recognition, speech recognition, natural language processing, and computer vision.","tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":24,"total_tokens":130,"completion_tokens":106}}% The result indicates that the LLM inference service is deployed based on ACS GPU compute power.
References
For more information about vLLM, see vllm-project.