All Products
Search
Document Center

Container Compute Service:Use ACS GPU compute power to deploy a model inference service based on the DeepSeek full version

Last Updated:Mar 07, 2025

Container Compute Service (ACS) does not require you to have deep knowledge about the underlying hardware or manage GPU-accelerated nodes. All configurations are out-of-the-box. ACS is easy to deploy and billed on a pay-as-you-go basis. It is suitable for LLM inference services, which can efficiently reduce the inference cost. This topic describes how to deploy a model inference service with the DeepSeek full version in ACS.

Background information

DeepSeek-R1

DeepSeek-R1 is the first-generation inference model provided by DeepSeek. It is intended for improve the inference performance of LLMs through large-scale enhanced learning. Statistics show that DeepSeek-R1 outperforms other closed source models in mathematical inference and programming competition. Its performance even reach or surpass the OpenAI-01 series in certain sectors. The performance of DeepSeek-R1 is also stunning in sectors related to knowledge, such as creation, writing, and Q&A. For more information about DeepSeek, see DeepSeek AI GitHub repository.

vLLM

vLLM is a high-performance and easy-to-use LLM inference service framework. vLLM supports most commonly used LLMs, including Qwen models. vLLM is powered by technologies such as PagedAttention optimization, continuous batching, and model quantification to greatly improve the inference efficiency of LLMs. For more information about the vLLM framework, see vLLM GitHub repository.

ACS

ACS was released in 2023. ACS focuses on consistently delivering inclusive, easy-to-use, elastic, and flexible next-generation container compute power. ACS provides general-purpose and heterogeneous compute power that complies with Kubernetes specifications. It provides serverless container compute resources and eliminates the need to worry about node and cluster O&M. You can integrate scheduling, container runtime, storage, and networking capabilities with ACS to reduce the O&M complexity of Kubernetes and improve the elasticity and flexibility of container compute power. With the pay-as-you-go billing method, elastic instances, and flexible capabilities, ACS can greatly reduce the resource cost. In LLM inference scenarios, ACS can accelerate data and image loading to further reduce the model launch time and resource cost.

Prerequisites.

GPU-accelerated instance specification and estimated cost

No GPU resources are used to accelerate the DeepSeek-R1 full version in ACS and it costs 16 GPU hours to deploy the inference service. Suggested ACS GPU-accelerated instance specification: 16 GPUs (96 GiB of memory per GPU), 64 vCPUs, and 512 GiB of memory. You can also refer to the Table of suggested specifications and GPU models and specifications. For more information about the billing of ACS GPU-accelerated instances, see Billing overview.

Note
  • Make sure that the specification of the ACS GPU-accelerated instance complies with ACS pod specification adjustment logic.

  • By default, an ACS pod provides 30 GiB of free EphemeralStorage. The inference image used in this example is large. If you need more storage space, customize the size of the EphemeralStorage. For more information, see Add the EphemeralStorage.

Procedure

Step 1: Prepare the DeepSeek-R1-GPTQ-INT8 model files

The LLM requires large amounts of disk space to store model files. We recommend that you use a NAS or OSS volume to persist the model files. In the following example, an OSS volume is used to persist the DeepSeek-R1-GPTQ-INT8 model files.

Note

Submit a ticket to obtain the model files and YAML content.

  • Model file: DeepSeek-R1-GPTQ-INT8.

  • GPU model: Replace the variable in the alibabacloud.com/gpu-model-series: <example-model> label with the actual GPU model supported by ACS. For more information, see Specify GPU models and driver versions for ACS GPU-accelerated pods.

  • Base image: Replace the variable in containers[].image: <base image obtained from the PDSA> with the actual image address.

  • Secret for pulling images: Obtain and create a Secret, and replace the variable in imagePullSecrets[].name: <Secret obtained from the PDSA> with the actual name of the Secret.

  1. (Optional) If you choose to download the model files to your local environment, create a directory in your OSS bucket and upload the model files to the directory.

    Note

    To install and use ossutil, see Install ossutil.

    ossutil mkdir oss://<your-bucket-name>/models/DeepSeek-R1-GPTQ-INT8
    ossutil cp -r /mnt/models/DeepSeek-R1-GPTQ-INT8 oss://<your-bucket-name>/models/DeepSeek-R1-GPTQ-INT8
  2. Create a PV named llm-model and a PVC for the cluster. For more information, see Mount a statically provisioned OSS volume.

    Use the console

    The following table describes the basic parameters that are used to create the PV.

    Parameter

    Description

    PV Type

    OSS

    Volume Name

    llm-model

    Access Certificate

    Specify the AccessKey ID and the AccessKey secret used to access the OSS bucket.

    Bucket ID

    Select the OSS bucket that you created in the previous step.

    OSS Path

    Select the path of the model, such as /models/DeepSeek-R1-GPTQ-INT8.

    The following table describes the basic parameters that are used to create the PVC.

    Parameter

    Description

    PVC Type

    OSS

    Name

    llm-model

    Allocation Mode

    In this example, Existing Volumes is selected.

    Existing Volumes

    Click Existing Volumes and select the PV that you created.

    Use kubectl

    The following code block shows the YAML template:

    apiVersion: v1
    kind: Secret
    metadata:
      name: oss-secret
    stringData:
      akId: <your-oss-ak> # The AccessKey ID used to access the OSS bucket.
      akSecret: <your-oss-sk> # The AccessKey secret used to access the OSS bucket.
    ---
    apiVersion: v1
    kind: PersistentVolume
    metadata:
      name: llm-model
      labels:
        alicloud-pvname: llm-model
    spec:
      capacity:
        storage: 30Gi 
      accessModes:
        - ReadOnlyMany
      persistentVolumeReclaimPolicy: Retain
      csi:
        driver: ossplugin.csi.alibabacloud.com
        volumeHandle: llm-model
        nodePublishSecretRef:
          name: oss-secret
          namespace: default
        volumeAttributes:
          bucket: <your-bucket-name> # The name of the OSS bucket.
          url: <your-bucket-endpoint> # The endpoint, such as oss-cn-hangzhou-internal.aliyuncs.com.
          otherOpts: "-o umask=022 -o max_stat_cache_size=0 -o allow_other"
          path: <your-model-path> # The model path, such as /models/DeepSeek-R1-GPTQ-INT8/ in this example.
    ---
    apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      name: llm-model
    spec:
      accessModes:
        - ReadOnlyMany
      resources:
        requests:
          storage: 30Gi
      selector:
        matchLabels:
          alicloud-pvname: llm-model

Step 2: Deploy the model based on ACS GPU compute power

  1. Run the following command to deploy the R1 model inference service that uses the vLLM framework.

    The inference service exposes an OpenAI-compatible HTTP API. Run the following code to treat the model parameter file as a special dataset and mount it to the specified path (/data/DeepSeek-R1-GPTQ-INT8) of the container that runs the inference service. --max_model_len specifies the maximum length of a token that can be processed by the model. You can increase the length to get higher performance. However, this also increases the usage of GPU memory. We recommend that you set the length to a value around 128000 for the DeepSeek-R1-GPTQ-INT8 model and adjust --gpu-memory-utilization.

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: deepseek-r1
      namespace: default
      labels:
        app: deepseek-r1
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: deepseek-r1
      strategy:
        type: RollingUpdate
        rollingUpdate:
          maxSurge: 25%
          maxUnavailable: 25%
      template:
        metadata:
          labels:
            app: deepseek-r1
            alibabacloud.com/compute-class: gpu
            # example-model indicates the GPU model. Replace it with the actual GPU model, such as T4.
            alibabacloud.com/gpu-model-series: <example-model>
        spec:
          imagePullSecrets:
          - name: <Secret obtained from the PDSA>
          containers:
          - name: llm-ds-r1
            image: <base image obtained from the PDSA>
            imagePullPolicy: IfNotPresent
            command:
            - sh
            - -c
            - "vllm serve /data/DeepSeek-R1-GPTQ-INT8 --port 8000 --trust-remote-code --served-model-name ds --max-model-len 128000 --quantization moe_wna16 --gpu-memory-utilization 0.98 --tensor-parallel-size 16"
            resources:
              limits:
                alibabacloud.com/gpu: "16"
                cpu: "64"
                memory: 512Gi
              requests:
                alibabacloud.com/gpu: "16"
                cpu: "64"
                memory: 512Gi
            volumeMounts:
            - name: llm-model
              mountPath: /data/DeepSeek-R1-GPTQ-INT8
            - name: shm
              mountPath: /dev/shm
          restartPolicy: Always
          terminationGracePeriodSeconds: 30
          volumes:
          - name: llm-model
            persistentVolumeClaim:
              claimName: llm-model
          - name: shm
            emptyDir:
              medium: Memory
              sizeLimit: 32Gi
    ---
    apiVersion: v1
    kind: Service
    metadata:
      name: deepseek-r1
    spec:
      type: ClusterIP
      selector:
        app: deepseek-r1
      ports:
        - protocol: TCP
          port: 8000
          targetPort: 8000 

Step 3: Verify the inference service

  1. Run kubectl port-forward to configure port forwarding between the local environment and inference service.

    Note

    Port forwarding set up by using kubectl port-forward is not reliable, secure, or extensible in production environments. It is only for development and debugging. Do not use this command to set up port forwarding in production environments. For more information about networking solutions used for production in ACK clusters, see Ingress management.

    kubectl port-forward svc/deepseek-r1 8000:8000

    Expected results:

    Forwarding from 127.0.0.1:8000 -> 8000
    Forwarding from [::1]:8000 -> 8000
  2. Send requests to the inference service.

    curl http://localhost:8000/v1/chat/completions \
      -H "Content-Type: application/json" \
      -d '{
        "model": "ds",
        "messages": [
          {
            "role": "user",
            "content": "Write a letter to my daughter from the future 2035 and tell her to study science and technology well, be the master of science and technology, and promote the development of science and technology and economy. She is now in grade 3."
          }
        ],
        "max_tokens": 1024,
        "temperature": 0.7,
        "top_p": 0.9,
        "seed": 10
      }'

    Expected results:

    {"id":"chatcmpl-53613fd815da46df92cc9b92cd156146","object":"chat.com pletion","created":1739261570,"model":"deepseek-r1","choices":[{"index":0,"message":{"role":"assistant","content":"<think>\nOK. The user needs to write a letter to his third-grade daughter from 2035 in the future, and convey three key messages at the same time: learn technology well, be the master of technology, and promote technology and economic development. First, I have to consider that the tone of the letter should be kind and hopeful, while reflecting a sense of future technology. \n\nConsidering that the daughter is now in the third grade, the content should not be too complicated and the language should be simple and easy to understand. At the same time, let the daughter feel the importance of science and technology and spike her interests in science and technology. It may be necessary to start from her daily life and give some examples that she may have come into contact with, such as electronic products, the Internet, etc., so that she can resonate more easily. \n\nNext, I have to think about how to structure this letter. It may start with welcoming her to receive this letter, and then introduce the development of future technology, such as smart robots and smart homes. Then it emphasizes the importance of learning science and technology, and encourages her to become the master of science and technology and participate in the development of science and technology. Finally, express the expectations and blessings. \n\nIn terms of content, it is necessary to highlight the impact of technology on life, such as smart assistants, smart homes, new energy vehicles, etc. These are all children may have heard of, but the specific details may need to be simplified to avoid being too technical and keep them interesting. \n\nAt the same time, the letter should mention the impact of science and technology on the economy, such as economic growth, job creation, etc., but it should be presented in a positive and encouraging way, so that the daughter can feel the benefits of science and technology, rather than a simple digital game. \n\nFinally, the ending part should be warm, express her pride and expectation, and encourage her to pursue the future bravely and become a leader in science and technology. \n\nIn general, this letter needs to be educational, interesting and encouraging, using simple and clear language, combined with specific examples of future technology, so that my daughter can feel the charm of technology and the importance of learning in a relaxed reading. \n</think>\n\nDear Future 2035: \n\nHello!  \n\nFirst, I want to tell you a good news: the earth has entered a new era!  By 2035, technology will no longer be the story of science fiction, but part of our every day life. Today, I am writing this letter to tell you some secrets about the future and how you should live and learn in this world of rapid development of science and technology. \n\n### 1. **Technology is around you**\n In 2035, technology is everywhere. Each of us can have an intelligent assistant, like an always-available teacher, ready to answer your questions. With a simple app, you can control the smart home devices in your home: turn on and off the lights, adjust the temperature, and even cook, all on your instruction!  \n   \n   Also, you may have heard about it: intelligent robots. These robots can not only help us to complete the tedious work, but also play a great part in learning and entertainment. They can chat with you, study with you, and even help you solve math problems!  Imagine that when you encounter a difficult problem, the robot will patiently teach you how to solve the problem step by step, isn't it great?  \n\n### 2. ** the importance of learning science and technology **\n in the future 2035, science and technology has become the main driving force to promote social development. Every industry is being transformed by technology: doctors can use advanced medical equipment early to detect illnesses; teachers can use online classrooms to enable students to learn global knowledge without leaving home; farmers can use smart devices to accurately manage their fields and ensure that every tree receives the best care. So, I want to tell you that learning technology is the most important task for every child. Science and technology can not only make you master more knowledge, but also make you become the future master. You will have the opportunity to create new technologies and change our lives!  \n\n### 3. **Be the master of science and technology**\n In 2035, the world of science and technology needs everyone's strength. You don't need to be a company executive, just be yourself. You can use your wisdom and hands to promote the development of science and technology. For example, you can participate in technological innovation competitions in schools and design smarter robots; you can invent some small inventions at home to make life more convenient. \n\n   It is important that you have the courage to try new things and explore the unknown. The world of science and technology is infinitely vast, and everyone can find their place here. \n\n### 4. ** About Economy **\n In 2035, the economy will become more prosperous due to the development of science and technology. Smart cities will make our lives more efficient, new energy vehicles will make our travel more environmentally friendly, and medical technology will better protect our health. \n\n   So, when you stand at the beginning of this era, you should know that technology is not only changing the way we live, but also creating opportunities for the future. \n\n### 5. **My expectations**\n    I hope that in the future you can love science and technology, understand science and technology, master science and technology. Not only do you have to learn how to use technology, but you have to understand the principles and the stories behind it. When you grow up, you may become a leader in the field of science and technology, leading us to a brighter future. \n\n   The future world needs you!  Are you ready for the challenge?  \n\nFinally, I want to tell you that you are smarter, braver and more potential than anyone else today. Although the road ahead is very long, as long as you are willing to work hard, you will certainly be able to realize your dream. \n\nDear daughter in 2035, fight!  \n \nYour grandpa ","tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":40,"total_tokens":1034,"completion_tokens":994,"prompt_tokens_details":null}"
    ,

(Optional) Step 4: Delete the environment

If you no longer need the inference service, delete the environment promptly.

  1. Delete the inference workload and service.

    kubectl delete deployment deepseek-r1
    kubectl delete service deepseek-r1
  2. Delete the PV and PVC.

    kubectl delete pvc llm-model
    kubectl delete pv llm-model

    Expected results:

    persistentvolumeclaim "llm-model" deleted
    persistentvolume "llm-model" deleted

References