Use ACS GPU compute power to create LLM inference services -

Container Compute Service (ACS) provides efficient and flexible container management and orchestration capabilities to enable you to deploy large language models (LLMs) and run LLM inference services. This topic describes how to deploy an LLM inference service from a container image in ACS.

Preparations

The information of the container image for deploying the LLM inference service in ACS is obtained.

In the left-side navigation pane, click Artifact Center.

In the Repository Name search box, enter llm-inference and find the egslingjun/llm-inference or egslingjun/inference-nv-pytorch image.

The egslingjun/llm-inference and egslingjun/inference-nv-pytorch images support the vLLM inference library and the DeepGPU-LLM inference engine. They can help you quickly set up an inference environment for LLMs such as Llama models, ChatGLM models, Baichuan models, or Qwen models. The images are updated every month. The following table describes details of the container images.

Image name	Image tag	Component information	Image address
llm-inference	vllm0.4.2-deepgpu-llm24.5-pytorch2.3.0-cuda12.1-ubuntu22.04	Base image: Ubuntu 22.04 Python 3.10 Torch 2.3.0 CUDA 12.1 vLLM 0.4.2 deepgpu-llm 24.5+pt2.3cu121	egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/llm-inference:vllm0.4.2-deepgpu-llm24.5-pytorch2.3.0-cuda12.1-ubuntu22.04
	vllm0.4.3-deepgpu-llm24.6-pytorch2.4.0-cuda12.4-ubuntu22.04	Base image: Ubuntu 22.04 Python 3.10 Torch 2.4.0 CUDA 12.4.1 vLLM 0.4.3 deepgpu-llm 24.6+pt2.4cu124	egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/llm-inference:vllm0.4.3-deepgpu-llm24.6-pytorch2.4.0-cuda12.4-ubuntu22.04
	vllm0.5.4-deepgpu-llm24.7-pytorch2.4.0-cuda12.4-ubuntu22.04	Base image: Ubuntu 22.04 Python 3.10 Torch 2.4.0 CUDA 12.4.1 vLLM 0.5.4 deepgpu-llm 24.7.2+pt2.4cu124	egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/llm-inference:vllm0.5.4-deepgpu-llm24.7-pytorch2.4.0-cuda12.4-ubuntu22.04
	vllm0.6.3-post1-deepgpu-llm24.9-pytorch2.4.0-cuda12.4-ubuntu22.04	Base image: Ubuntu 22.04 Python 3.10 Torch 2.4.0 CUDA 12.4.1 vLLM 0.6.3.post1 deepgpu-llm 24.9+pt2.4cu124	egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/llm-inference:vllm0.6.3-post1-deepgpu-llm24.9-pytorch2.4.0-cuda12.4-ubuntu22.04
	vllm0.6.4.post1-deepgpu-llm24.10-pytorch2.5.1-cuda12.4-ubuntu22.04-201412	Base image: Ubuntu 22.04 Python 3.10 Torch 2.5.1 CUDA 12.4.1 vLLM 0.6.4.post1 deepgpu-llm 24.10+pt2.5cu124	egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/llm-inference:vllm0.6.4.post1-deepgpu-llm24.10-pytorch2.5.1-cuda12.4-ubuntu22.04-201412
inference-nv-pytorch	25.01-vllm0.6.5-deepgpu-llm24.10-pytorch2.5-cuda12.4-20250121	Base image: Ubuntu 22.04 Python 3.10 Torch 2.5.1 CUDA 12.4.1 vLLM 0.6.5 deepgpu-llm 24.10+pt2.5cu124	egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:25.01-vllm0.6.5-deepgpu-llm24.10-pytorch2.5-cuda12.4-20250121

A NAS file system is prepared. The LLM requires large amounts of disk space to store model files. We recommend that you create a NAS volume to store the model files. For more information, see Mount a statically provisioned NAS volume or Mount a dynamically provisioned NAS volume. In this example, a statically provisioned NAS volume is used and the suggested volume size is 20 GiB.
Create a PVC based on the following parameters. For more information, see Create a PVC.
Parameter
Example
Name
nas-test.
Allocation Mode
Select Use Mount Target Domain Name.
Capacity
20 GiB.
Mount Target Domain Name
08cxxxxxxec-wxxxxxxcn-hangzhou.nas.aliyuncs.com
Mount the NAS file system to an ECS instance. For more information, see Mount an NFS file system in the NAS console.
Run the following command to download the model to the NAS file system.
```
cd /mnt
pip install modelscope
modelscope download --model Qwen/Qwen2.5-7B-Instruct --local_dir ./Qwen2.5-7B-Instruct
```
Note
- The Qwen2.5-7B model in this example is about 15 GB in size. It requires 20 minutes to download it at 100 Mbit/s. You can choose the Qwen2.5-3B or Qwen2.5-0.5B model, which is more lightweight.
- modelscope requires Python 3.10 or later. We recommend that you purchase an ECS instance that runs Ubuntu 22.04 because it uses Python 3.10 by default. You can also specify a public or custom image that uses Python 3.10 or later.

Deploy the LLM inference service

Use kubectl to connect to the ACS cluster. For more information, see Obtain the kubeconfig file of a cluster and use kubectl to connect to the cluster.

Create a file named your-deployment.yaml based on the following content.

Note

For more information about the GPU model used in this example, see GPU models.

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: my-deployment
  name: my-deployment
  namespace: default
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: my-deployment
  template:
    metadata:
      labels:
        # Set the GPU model to example-model. The value is for reference only.
        alibabacloud.com/gpu-model-series: example-model
        # Set the compute class to gpu.
        alibabacloud.com/compute-class: gpu
        # If you want to use BestEffort pods, set the following parameter to best-effort.
        alibabacloud.com/compute-qos: default
        app: my-deployment
    spec:
      containers:
        - command:
          - sh
          - -c
          - python3 -m vllm.entrypoints.openai.api_server --model /mnt/Qwen2.5-7B-Instruct --trust-remote-code --tensor-parallel-size 1  --disable-custom-all-reduce
          image: egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/llm-inference:vllm0.5.4-deepgpu-llm24.7-pytorch2.4.0-cuda12.4-ubuntu22.04
          imagePullPolicy: IfNotPresent #Always
          name: my-deployment
          resources:
            limits:
              cpu: 16
              memory: 64Gi
              nvidia.com/gpu: "1"
            requests:
              cpu: 16
              memory: 64Gi
              nvidia.com/gpu: "1"
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          volumeMounts:
            - mountPath: /mnt
              name: nas-test
            - mountPath: /dev/shm
              name: cache-volume
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
      volumes:
        - name: nas-test
          persistentVolumeClaim:
            #The content of the PVC.
            claimName: nas-test
        - name: cache-volume
          emptyDir:
            medium: Memory
            sizeLimit: 64G
---
apiVersion: v1
kind: Service
metadata:
  annotations:
    service.beta.kubernetes.io/alibaba-cloud-loadbalancer-address-type: "internet"
    service.beta.kubernetes.io/alibaba-cloud-loadbalancer-ip-version: ipv4
  labels:
    app: my-deployment
  name: svc-llm
  namespace: default
spec:
  externalTrafficPolicy: Local
  ports:
  - name: serving
    port: 8000
    protocol: TCP
    targetPort: 8000
  selector:
    app: my-deployment
  type: LoadBalancer

Run the following command to deploy the LLM inference task and Service.
```
kubectl apply -f your-deployment.yaml
```

It requires 20 minutes to deploy the model due to the size of the image. You can run the following command to query the deployment progress.

kubectl logs -f my-deployment-787b8xxxxx-xxxxx

If the following output is returned, the model is successfully deployed.

INFO:     Started server process [2]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
INFO 12-13 12:39:11 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 12-13 12:39:21 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 12-13 12:39:31 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 12-13 12:39:41 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.

Run the following command to query the Service.

kubectl get svc

Expected results:

NAME         TYPE           CLUSTER-IP     EXTERNAL-IP      PORT(S)    AGE
kubernetes   ClusterIP      10.0.0.1       <none>           443/TCP    3h38m
svc-llm      LoadBalancer   10.0.143.103   112.xxx.xxx.177  8000/TCP   58s

EXTERNAL-IP displays the IP address that is used to expose the Service. Record this IP address for testing.

Test the LLM inference service

Run the following command and enter Prompt.

export EXTERNAL_IP=112.xxx.xxx.177
curl http://$EXTERNAL_IP:8000/v1/chat/completions \
     -H "Content-Type: application/json"    \
     -d '{    "model": "/mnt/Qwen2.5-7B-Instruct",      "messages": [   {"role": "system", "content": "You are a friendly AI assistant"},   {"role": "user", "content": "Introduce deep learning"}    ]}'

Expected results:

{"id":"chat-edab465b4b5547bda7xxxxxxxxxxxxxxx","object":"chat.completion","created":1734094178,"model":"/mnt/Qwen2.5-7B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"Deep learning is a subset of machine learning that focuses on utilizing neural networks to process and study large amounts of data to enable computers to learn and make decisions as humans. Deep learning consists of CNN, RNN, and LSTM. These models can recognize and classify images, extract features, perform natural language processing, and event run speech recognition tasks. Deep learning applies to a wide array of industries, including image recognition, speech recognition, natural language processing, and computer vision.","tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":24,"total_tokens":130,"completion_tokens":106}}%

The result indicates that the LLM inference service is deployed based on ACS GPU compute power.

References

For more information about vLLM, see vllm-project.

Parameter	Example
Name	nas-test.
Allocation Mode	Select Use Mount Target Domain Name.
Capacity	20 GiB.
Mount Target Domain Name	08cxxxxxxec-wxxxxxxcn-hangzhou.nas.aliyuncs.com