All Products
Search
Document Center

Container Compute Service:Use ACS GPU compute power to deploy a model inference service from a DeepSeek distilled model

Last Updated:Sep 09, 2025

Container Compute Service (ACS) does not require you to have deep knowledge about the underlying hardware or manage GPU-accelerated nodes. All configurations are out-of-the-box. ACS is easy to deploy and billed on a pay-as-you-go basis. It is suitable for LLM inference services, which can efficiently reduce the inference cost. This topic describes how to deploy a model inference service with a DeepSeek distilled version in ACS.

Background information

DeepSeek-R1

DeepSeek-R1 is the first-generation inference model provided by DeepSeek. It is intended for improve the inference performance of LLMs through large-scale enhanced learning. Statistics show that DeepSeek-R1 outperforms other closed source models in mathematical inference and programming competition. Its performance even reach or surpass the OpenAI-01 series in certain sectors. The performance of DeepSeek-R1 is also stunning in sectors related to knowledge, such as creation, writing, and Q&A. DeepSeek also distills inference capabilities to smaller models, such as Qwen and Llama, to fine-tune their inference performance. The 14B model distilled from DeepSeek surpasses the open source QwQ-32B model. The 32B and 70B models distilled from DeepSeek also hit new records. For more information about DeepSeek, see DeepSeek AI GitHub repository.

vLLM

vLLM is a high-performance and easy-to-use LLM inference service framework. vLLM supports most commonly used LLMs, including the Qwen series of models. vLLM is powered by technologies such as PagedAttention optimization, continuous batching, and model quantification to greatly improve the inference efficiency of LLMs. See vLLM GitHub repository.

Arena

Arena is a lightweight client that is used to manage Kubernetes-based machine learning tasks. Arena allows you to streamline data preparation, model development, model training, and model prediction throughout the entire lifecycle of machine learning. This improves the work efficiency of data scientists. Arena is also deeply integrated with the basic services of Alibaba Cloud. It supports GPU sharing and Cloud Paralleled File System (CPFS). Arena can run in deep learning frameworks optimized by Alibaba Cloud. This maximizes the performance and utilization of heterogeneous computing resources provided by Alibaba Cloud. For more information about Arena, see Arena GitHub repository.

Prerequisites

GPU-accelerated instance specification and estimated cost

GPU memory is occupied by model parameters during the inference phase. The usage is calculated based on the following formula.

GPU memory = Number of model parameters x Bytes of precision data

Take a model whose default precision is FP16 and parameter quantity is 7B as an example. The model has 7 billion parameters. Its precision data is 2 bytes (default 16-bit floating number/8 bits per byte).

GPU memory = 7 x 109 x 2 bytes ≈ 13.04 GiB

In addition to the memory used to load the model, you also need to consider the size of the KV cache and the GPU utilization. Typically, a proportion of memory is reserved for buffering. Therefore, the suggested specification is 1GPU with 24 GiB of memory, 8 vCPUs and 32 GiB of memory. You can also refer to the table of suggested specifications and GPU models and specifications. For more information about the billing of ACS GPU-accelerated instances, see Billing overview.

Model name

Model version

Model size

Suggested specification

vCPU

Memory

GPU memory

DeepSeek-R1-Distill-Qwen-1.5B

1.5B (1.5 billion parameters)

3.55 GB

4 or 6

30 GiB

24 GiB

DeepSeek-R1-Distill-Qwen-7B

7B (7 billion parameters)

15.23 GB

6 or 8

32 GiB

24 GiB

DeepSeek-R1-Distill-Llama-8B

8B (8 billion parameters)

16.06 GB

6 or 8

32 GiB

24 GiB

DeepSeek-R1-Distill-Qwen-14B

14B (14 billion parameters)

29.54 GB

Larger than 8

64 GiB

48 GiB

DeepSeek-R1-Distill-Qwen-32B

32B (32 billion parameters)

74.32 GB

Larger than 8

128 GiB

96 GiB

DeepSeek-R1-Distill-Llama-70B

70B (70 billion parameters)

140.56 GB

Larger than 12

128 GiB

192 GiB

Note

Procedure

Step 1: Prepare the DeepSeek-R1-Distill-Qwen-7B model files

Note

It usually takes 1 to 2 hours to download and upload the model. You can submit a ticket to copy the model files to your OSS bucket.

  1. Run the following command to download the DeepSeek-R1-Distill-Qwen-7B model from ModelScope.

    Note

    Check whether the git-lfs plug-in is installed. If not, run yum install git-lfs or apt-get install git-lfs to install it. For more information, see Install git-lfs.

    git lfs install
    GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B.git
    cd DeepSeek-R1-Distill-Qwen-7B/
    git lfs pull
  2. Create an OSS directory and upload the model files to the directory.

    Note

    To install and use ossutil, see Install ossutil.

    ossutil mkdir oss://<your-bucket-name>/models/DeepSeek-R1-Distill-Qwen-7B
    ossutil cp -r ./DeepSeek-R1-Distill-Qwen-7B oss://<your-bucket-name>/models/DeepSeek-R1-Distill-Qwen-7B
  3. Create a PV and a PVC. Create a PV named llm-model and a PVC for the cluster. For more information, see Mount a statically provisioned OSS volume.

    The following table describes the basic parameters that are used to create the PV.

    Parameter

    Description

    PV Type

    OSS

    Volume Name

    llm-model

    Access Certificate

    Specify the AccessKey ID and the AccessKey secret used to access the OSS bucket.

    Bucket ID

    Select the OSS bucket that you created in the previous step.

    OSS Path

    Select the path of the model, such as /models/DeepSeek-R1-Distill-Qwen-7B.

    The following table describes the basic parameters that are used to create the PVC.

    Parameter

    Description

    PVC Type

    OSS

    Name

    llm-model

    Allocation Mode

    In this example, Existing Volumes is selected.

    Existing Volumes

    Click Existing Volumes and select the PV that you created.

    The following code block shows the YAML template:

    apiVersion: v1
    kind: Secret
    metadata:
      name: oss-secret
    stringData:
      akId: <your-oss-ak> # The AccessKey ID used to access the OSS bucket.
      akSecret: <your-oss-sk> # The AccessKey secret used to access the OSS bucket.
    ---
    apiVersion: v1
    kind: PersistentVolume
    metadata:
      name: llm-model
      labels:
        alicloud-pvname: llm-model
    spec:
      capacity:
        storage: 30Gi 
      accessModes:
        - ReadOnlyMany
      persistentVolumeReclaimPolicy: Retain
      csi:
        driver: ossplugin.csi.alibabacloud.com
        volumeHandle: llm-model
        nodePublishSecretRef:
          name: oss-secret
          namespace: default
        volumeAttributes:
          bucket: <your-bucket-name> # The name of the OSS bucket.
          url: <your-bucket-endpoint> # The endpoint, such as oss-cn-hangzhou-internal.aliyuncs.com.
          otherOpts: "-o umask=022 -o max_stat_cache_size=0 -o allow_other"
          path: <your-model-path> # The model path, such as /models/DeepSeek-R1-Distill-Qwen-7B/ in this example.
    ---
    apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      name: llm-model
    spec:
      accessModes:
        - ReadOnlyMany
      resources:
        requests:
          storage: 30Gi
      selector:
        matchLabels:
          alicloud-pvname: llm-model

Step 2: Deploy the model

  1. Run the following command to deploy the DeepSeek-R1-Distill-Qwen-7B model inference service that uses the vLLM framework.

    The inference service exposes an OpenAI-compatible HTTP API. In the following code block, the --data parameter provided by the Arena client is used to treat the model parameter file as a special dataset and mount it to the specified path (/model/DeepSeek-R1-Distill-Qwen-7B) of the container that runs the inference service. --max_model_len specifies the maximum length of a token that can be processed by the model. You can increase the length to get higher performance. However, this also increases the usage of GPU memory.

    Note
    • Replace the variable in the gpu-model-series=<example-model> command with the actual GPU model supported by ACS. Submit a ticket for the list of GPU models supported by ACS.

    • egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/{image:tag} is the address of the public image. We recommend that you use VPC to accelerate the pulling of AI container images.

    arena serve custom \
    --name=deepseek-r1 \
    --version=v1 \
    --gpus=1 \
    --cpu=8 \
    --memory=32Gi \
    --replicas=1 \
    --label=alibabacloud.com/compute-class=gpu \
    --label=alibabacloud.com/gpu-model-series=<example-model> \
    --restful-port=8000 \
    --readiness-probe-action="tcpSocket" \
    --readiness-probe-action-option="port: 8000" \
    --readiness-probe-option="initialDelaySeconds: 30" \
    --readiness-probe-option="periodSeconds: 30" \
    --image=egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:25.02-vllm0.7.2-sglang0.4.3.post2-pytorch2.5-cuda12.4-20250305-serverless \
    --data=llm-model:/models/DeepSeek-R1-Distill-Qwen-7B \
    "vllm serve /models/DeepSeek-R1-Distill-Qwen-7B --port 8000 --trust-remote-code --served-model-name deepseek-r1 --max-model-len 32768 --gpu-memory-utilization 0.95 --enforce-eager"

    Expected results:

    service/deepseek-r1-v1 created
    deployment.apps/deepseek-r1-v1-custom-serving created
    INFO[0004] The Job deepseek-r1 has been submitted successfully
    INFO[0004] You can run `arena serve get deepseek-r1 --type custom-serving -n default` to check the job status

    The following table describes the parameters.

    Parameter

    Description

    --name

    The name of the inference service.

    --version

    The version of the inference service.

    --gpus

    The number of GPUs used by each inference service replica.

    --cpu

    The number of vCPUs used by each inference service replica.

    --memory

    The amount of memory used by each inference service replica.

    --replicas

    The number of inference service replicas.

    --label

    Add the following labels to specify ACS GPU compute power.

    --label=alibabacloud.com/compute-class=gpu

    --label=alibabacloud.com/gpu-model-series=<example-model>

    --restful-port

    The port of the inference service to be exposed.

    --readiness-probe-action

    The connection type of readiness probes. Valid values: httpGet, exec, grpc, and tcpSocket.

    --readiness-probe-action-option

    The connection method of readiness probes.

    --readiness-probe-option

    The readiness probe configuration.

    --image

    The address of the inference service image.

    --data

    Mount a shared PVC to the runtime environment. The value consists of two parts separated by a colon (:). Specify the name of the PVC on the left side of the colon. You can run the arena data list command to view PVCs in the current cluster. Specify the path to which the PVC is mounted on the right side of the colon. The training data will be read from the specified path. This way, your training job can retrieve the data stored in the PV claimed by the PVC.

  2. Run the following command to query the details of the inference service:

    arena serve get deepseek-r1

    Expected results:

    Name:       deepseek-r1
    Namespace:  default
    Type:       Custom
    Version:    v1
    Desired:    1
    Available:  1
    Age:        6h
    Address:    10.0.78.27
    Port:       RESTFUL:8000
    GPU:        1
    
    Instances:
      NAME                                            STATUS   AGE  READY  RESTARTS  GPU  NODE
      ----                                            ------   ---  -----  --------  ---  ----
      deepseek-r1-v1-custom-serving-54d579d994-dqwxz  Running  1h   1/1    0         1    virtual-kubelet-cn-hangzhou-b

Step 3: Verify the inference services

  1. Run kubectl port-forward to configure port forwarding between the local environment and inference service.

    Note

    Port forwarding set up by using kubectl port-forward is not reliable, secure, or extensible in production environments. It is only for development and debugging. Do not use this command to set up port forwarding in production environments. For more information about networking solutions used for production in ACK clusters, see Ingress management.

    kubectl port-forward svc/deepseek-r1-v1 8000:8000

    Expected results:

    Forwarding from 127.0.0.1:8000 -> 8000
    Forwarding from [::1]:8000 -> 8000
  2. Send requests to the inference service.

    curl http://localhost:8000/v1/chat/completions \
      -H "Content-Type: application/json" \
      -d '{
        "model": "deepseek-r1",
        "messages": [
          {
            "role": "user",
            "content": "Write a letter to my daughter from the future 2035 and tell her to study science and technology well, be the master of science and technology, and promote the development of science and technology and economy. She is now in grade 3."
          }
        ],
        "max_tokens": 1024,
        "temperature": 0.7,
        "top_p": 0.9,
        "seed": 10
      }'

    Expected results:

    {"id":"chatcmpl-53613fd815da46df92cc9b92cd156146","object":"chat.com pletion","created":1739261570,"model":"deepseek-r1","choices":[{"index":0,"message":{"role":"assistant","content":"<think>\nOK. The user needs to write a letter to his third-grade daughter from 2035 in the future, and convey three key messages at the same time: learn technology well, be the master of technology, and promote technology and economic development. First, I have to consider that the tone of the letter should be kind and hopeful, while reflecting a sense of future technology. \n\nConsidering that the daughter is now in the third grade, the content should not be too complicated and the language should be simple and easy to understand. At the same time, let the daughter feel the importance of science and technology and spike her interests in science and technology. It may be necessary to start from her daily life and give some examples that she may have come into contact with, such as electronic products, the Internet, etc., so that she can resonate more easily. \n\nNext, I have to think about how to structure this letter. It may start with welcoming her to receive this letter, and then introduce the development of future technology, such as smart robots and smart homes. Then it emphasizes the importance of learning science and technology, and encourages her to become the master of science and technology and participate in the development of science and technology. Finally, express the expectations and blessings. \n\nIn terms of content, it is necessary to highlight the impact of technology on life, such as smart assistants, smart homes, new energy vehicles, etc. These are all children may have heard of, but the specific details may need to be simplified to avoid being too technical and keep them interesting. \n\nAt the same time, the letter should mention the impact of science and technology on the economy, such as economic growth, job creation, etc., but it should be presented in a positive and encouraging way, so that the daughter can feel the benefits of science and technology, rather than a simple digital game. \n\nFinally, the ending part should be warm, express her pride and expectation, and encourage her to pursue the future bravely and become a leader in science and technology. \n\nIn general, this letter needs to be educational, interesting and encouraging, using simple and clear language, combined with specific examples of future technology, so that my daughter can feel the charm of technology and the importance of learning in a relaxed reading. \n</think>\n\nDear Future 2035: \n\nHello! \n\nFirst, I want to tell you a good news: the earth has entered a new era! By 2035, technology will no longer be the story of science fiction, but part of our every day life. Today, I am writing this letter to tell you some secrets about the future and how you should live and learn in this world of rapid development of science and technology. \n\n### 1. **Technology is around you**\n In 2035, technology is everywhere. Each of us can have an intelligent assistant, like an always-available teacher, ready to answer your questions. With a simple app, you can control the smart home devices in your home: turn on and off the lights, adjust the temperature, and even cook, all on your instruction! \n   \n   Also, you may have heard about it: intelligent robots. These robots can not only help us to complete the tedious work, but also play a great part in learning and entertainment. They can chat with you, study with you, and even help you solve math problems! Imagine that when you encounter a difficult problem, the robot will patiently teach you how to solve the problem step by step, isn't it great? \n\n### 2. ** the importance of learning science and technology **\n in the future 2035, science and technology has become the main driving force to promote social development. Every industry is being transformed by technology: doctors can use advanced medical equipment early to detect illnesses; teachers can use online classrooms to enable students to learn global knowledge without leaving home; farmers can use smart devices to accurately manage their fields and ensure that every tree receives the best care. \n\n   So, I want to tell you that learning technology is the most important task for every child. Science and technology can not only make you master more knowledge, but also make you become the future master. You will have the opportunity to create new technologies and change our lives! \n\n### 3. **Be the master of science and technology**\n In 2035, the world of science and technology needs everyone's strength. You don't need to be a company executive, just be yourself. You can use your wisdom and hands to promote the development of science and technology. For example, you can participate in technological innovation competitions in schools and design smarter robots; you can invent some small inventions at home to make life more convenient. \n\n   It is important that you have the courage to try new things and explore the unknown. The world of science and technology is infinitely vast, and everyone can find their place here. \n\n### 4. ** About Economy **\n In 2035, the economy will become more prosperous due to the development of science and technology. Smart cities will make our lives more efficient, new energy vehicles will make our travel more environmentally friendly, and medical technology will better protect our health. \n\n   So, when you stand at the beginning of this era, you should know that technology is not only changing the way we live, but also creating opportunities for the future. \n\n### 5. **My expectations**\n    I hope that in the future you can love science and technology, understand science and technology, master science and technology. Not only do you have to learn how to use technology, but you have to understand the principles and the stories behind it. When you grow up, you may become a leader in the field of science and technology, leading us to a brighter future. \n\n   The future world needs you! Are you ready for the challenge? \n\nFinally, I want to tell you that you are smarter, braver and more potential than anyone else today. Although the road ahead is very long, as long as you are willing to work hard, you will certainly be able to realize your dream. \n\nDear daughter in 2035, fight! \n \nYour grandpa ","tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":40,"total_tokens":1034,"completion_tokens":994,"prompt_tokens_details":null}"

    ,

(Optional) Step 4: Clear the environment

If you no longer need the inference service, delete the environment promptly.

  1. Delete the inference service.

    arena serve delete deepseek-r1

    Expected results:

    INFO[0007] The serving job deepseek-r1 with version v1 has been deleted successfully
  2. Delete the PV and PVC.

    kubectl delete pvc llm-model
    kubectl delete pv llm-model

    Expected results:

    persistentvolumeclaim "llm-model" deleted
    persistentvolume "llm-model" deleted

References