All Products
Search
Document Center

:Use ACS GPU compute power to create LLM inference services

Last Updated:Feb 21, 2025

Container Compute Service (ACS) provides efficient and flexible container management and orchestration capabilities to enable you to deploy large language models (LLMs) and run LLM inference services. This topic describes how to deploy an LLM inference service from a container image in ACS.

Preparations

  1. The information of the container image for deploying the LLM inference service in ACS is obtained.

    1. In the left-side navigation pane, click Artifact Center.

    2. In the Repository Name search box, enter llm-inference and find the egslingjun/llm-inference or egslingjun/inference-nv-pytorch image.

      The egslingjun/llm-inference and egslingjun/inference-nv-pytorch images support the vLLM inference library and the DeepGPU-LLM inference engine. They can help you quickly set up an inference environment for LLMs such as Llama models, ChatGLM models, Baichuan models, or Qwen models. The images are updated every month. The following table describes details of the container images.

      Image name

      Image tag

      Component information

      Image address

      llm-inference

      vllm0.4.2-deepgpu-llm24.5-pytorch2.3.0-cuda12.1-ubuntu22.04

      • Base image: Ubuntu 22.04

      • Python 3.10

      • Torch 2.3.0

      • CUDA 12.1

      • vLLM 0.4.2

      • deepgpu-llm 24.5+pt2.3cu121

      egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/llm-inference:vllm0.4.2-deepgpu-llm24.5-pytorch2.3.0-cuda12.1-ubuntu22.04

      vllm0.4.3-deepgpu-llm24.6-pytorch2.4.0-cuda12.4-ubuntu22.04

      • Base image: Ubuntu 22.04

      • Python 3.10

      • Torch 2.4.0

      • CUDA 12.4.1

      • vLLM 0.4.3

      • deepgpu-llm 24.6+pt2.4cu124

      egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/llm-inference:vllm0.4.3-deepgpu-llm24.6-pytorch2.4.0-cuda12.4-ubuntu22.04

      vllm0.5.4-deepgpu-llm24.7-pytorch2.4.0-cuda12.4-ubuntu22.04

      • Base image: Ubuntu 22.04

      • Python 3.10

      • Torch 2.4.0

      • CUDA 12.4.1

      • vLLM 0.5.4

      • deepgpu-llm 24.7.2+pt2.4cu124

      egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/llm-inference:vllm0.5.4-deepgpu-llm24.7-pytorch2.4.0-cuda12.4-ubuntu22.04

      vllm0.6.3-post1-deepgpu-llm24.9-pytorch2.4.0-cuda12.4-ubuntu22.04

      • Base image: Ubuntu 22.04

      • Python 3.10

      • Torch 2.4.0

      • CUDA 12.4.1

      • vLLM 0.6.3.post1

      • deepgpu-llm 24.9+pt2.4cu124

      egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/llm-inference:vllm0.6.3-post1-deepgpu-llm24.9-pytorch2.4.0-cuda12.4-ubuntu22.04

      vllm0.6.4.post1-deepgpu-llm24.10-pytorch2.5.1-cuda12.4-ubuntu22.04-201412

      • Base image: Ubuntu 22.04

      • Python 3.10

      • Torch 2.5.1

      • CUDA 12.4.1

      • vLLM 0.6.4.post1

      • deepgpu-llm 24.10+pt2.5cu124

      egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/llm-inference:vllm0.6.4.post1-deepgpu-llm24.10-pytorch2.5.1-cuda12.4-ubuntu22.04-201412

      inference-nv-pytorch

      25.01-vllm0.6.5-deepgpu-llm24.10-pytorch2.5-cuda12.4-20250121

      • Base image: Ubuntu 22.04

      • Python 3.10

      • Torch 2.5.1

      • CUDA 12.4.1

      • vLLM 0.6.5

      • deepgpu-llm 24.10+pt2.5cu124

      egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:25.01-vllm0.6.5-deepgpu-llm24.10-pytorch2.5-cuda12.4-20250121

  2. A NAS file system is prepared. The LLM requires large amounts of disk space to store model files. We recommend that you create a NAS volume to store the model files. For more information, see Mount a statically provisioned NAS volume or Mount a dynamically provisioned NAS volume. In this example, a statically provisioned NAS volume is used and the suggested volume size is 20 GiB.

    Create a PVC based on the following parameters. For more information, see Create a PVC.

    Parameter

    Example

    Name

    nas-test.

    Allocation Mode

    Select Use Mount Target Domain Name.

    Capacity

    20 GiB.

    Mount Target Domain Name

    08cxxxxxxec-wxxxxxxcn-hangzhou.nas.aliyuncs.com

  3. Mount the NAS file system to an ECS instance. For more information, see Mount an NFS file system in the NAS console.

    Run the following command to download the model to the NAS file system.

    cd /mnt
    pip install modelscope
    modelscope download --model Qwen/Qwen2.5-7B-Instruct --local_dir ./Qwen2.5-7B-Instruct
    Note
    • The Qwen2.5-7B model in this example is about 15 GB in size. It requires 20 minutes to download it at 100 Mbit/s. You can choose the Qwen2.5-3B or Qwen2.5-0.5B model, which is more lightweight.

    • modelscope requires Python 3.10 or later. We recommend that you purchase an ECS instance that runs Ubuntu 22.04 because it uses Python 3.10 by default. You can also specify a public or custom image that uses Python 3.10 or later.

Deploy the LLM inference service

  1. Use kubectl to connect to the ACS cluster. For more information, see Obtain the kubeconfig file of a cluster and use kubectl to connect to the cluster.

  2. Create a file named your-deployment.yaml based on the following content.

    Note

    For more information about the GPU model used in this example, see GPU models.

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      labels:
        app: my-deployment
      name: my-deployment
      namespace: default
    spec:
      progressDeadlineSeconds: 600
      replicas: 1
      revisionHistoryLimit: 10
      selector:
        matchLabels:
          app: my-deployment
      template:
        metadata:
          labels:
            # Set the GPU model to example-model. The value is for reference only.
            alibabacloud.com/gpu-model-series: example-model
            # Set the compute class to gpu.
            alibabacloud.com/compute-class: gpu
            # If you want to use BestEffort pods, set the following parameter to best-effort.
            alibabacloud.com/compute-qos: default
            app: my-deployment
        spec:
          containers:
            - command:
              - sh
              - -c
              - python3 -m vllm.entrypoints.openai.api_server --model /mnt/Qwen2.5-7B-Instruct --trust-remote-code --tensor-parallel-size 1  --disable-custom-all-reduce
              image: egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/llm-inference:vllm0.5.4-deepgpu-llm24.7-pytorch2.4.0-cuda12.4-ubuntu22.04
              imagePullPolicy: IfNotPresent #Always
              name: my-deployment
              resources:
                limits:
                  cpu: 16
                  memory: 64Gi
                  nvidia.com/gpu: "1"
                requests:
                  cpu: 16
                  memory: 64Gi
                  nvidia.com/gpu: "1"
              terminationMessagePath: /dev/termination-log
              terminationMessagePolicy: File
              volumeMounts:
                - mountPath: /mnt
                  name: nas-test
                - mountPath: /dev/shm
                  name: cache-volume
          dnsPolicy: ClusterFirst
          restartPolicy: Always
          schedulerName: default-scheduler
          securityContext: {}
          terminationGracePeriodSeconds: 30
          volumes:
            - name: nas-test
              persistentVolumeClaim:
                #The content of the PVC.
                claimName: nas-test
            - name: cache-volume
              emptyDir:
                medium: Memory
                sizeLimit: 64G
    ---
    apiVersion: v1
    kind: Service
    metadata:
      annotations:
        service.beta.kubernetes.io/alibaba-cloud-loadbalancer-address-type: "internet"
        service.beta.kubernetes.io/alibaba-cloud-loadbalancer-ip-version: ipv4
      labels:
        app: my-deployment
      name: svc-llm
      namespace: default
    spec:
      externalTrafficPolicy: Local
      ports:
      - name: serving
        port: 8000
        protocol: TCP
        targetPort: 8000
      selector:
        app: my-deployment
      type: LoadBalancer
  3. Run the following command to deploy the LLM inference task and Service.

    kubectl apply -f your-deployment.yaml
  4. It requires 20 minutes to deploy the model due to the size of the image. You can run the following command to query the deployment progress.

    kubectl logs -f my-deployment-787b8xxxxx-xxxxx

    If the following output is returned, the model is successfully deployed.

    INFO:     Started server process [2]
    INFO:     Waiting for application startup.
    INFO:     Application startup complete.
    INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
    INFO 12-13 12:39:11 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
    INFO 12-13 12:39:21 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
    INFO 12-13 12:39:31 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
    INFO 12-13 12:39:41 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
  5. Run the following command to query the Service.

    kubectl get svc

    Expected results:

    NAME         TYPE           CLUSTER-IP     EXTERNAL-IP      PORT(S)    AGE
    kubernetes   ClusterIP      10.0.0.1       <none>           443/TCP    3h38m
    svc-llm      LoadBalancer   10.0.143.103   112.xxx.xxx.177  8000/TCP   58s

    EXTERNAL-IP displays the IP address that is used to expose the Service. Record this IP address for testing.

Test the LLM inference service

Run the following command and enter Prompt.

export EXTERNAL_IP=112.xxx.xxx.177
curl http://$EXTERNAL_IP:8000/v1/chat/completions \
     -H "Content-Type: application/json"    \
     -d '{    "model": "/mnt/Qwen2.5-7B-Instruct",      "messages": [   {"role": "system", "content": "You are a friendly AI assistant"},   {"role": "user", "content": "Introduce deep learning"}    ]}'

Expected results:

{"id":"chat-edab465b4b5547bda7xxxxxxxxxxxxxxx","object":"chat.completion","created":1734094178,"model":"/mnt/Qwen2.5-7B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"Deep learning is a subset of machine learning that focuses on utilizing neural networks to process and study large amounts of data to enable computers to learn and make decisions as humans. Deep learning consists of CNN, RNN, and LSTM. These models can recognize and classify images, extract features, perform natural language processing, and event run speech recognition tasks. Deep learning applies to a wide array of industries, including image recognition, speech recognition, natural language processing, and computer vision.","tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":24,"total_tokens":130,"completion_tokens":106}}%  

The result indicates that the LLM inference service is deployed based on ACS GPU compute power.

References

For more information about vLLM, see vllm-project.