Use ACS GPU compute power to deploy a model inference service from a DeepSeek distilled model - Container Compute Service

Container Compute Service (ACS) does not require you to have deep knowledge about the underlying hardware or manage GPU-accelerated nodes. All configurations are out-of-the-box. ACS is easy to deploy and billed on a pay-as-you-go basis. It is suitable for LLM inference services, which can efficiently reduce the inference cost. This topic describes how to deploy a model inference service with a DeepSeek distilled version in ACS.

Background information

DeepSeek-R1

DeepSeek-R1 is the first-generation inference model provided by DeepSeek. It is intended for improve the inference performance of LLMs through large-scale enhanced learning. Statistics show that DeepSeek-R1 outperforms other closed source models in mathematical inference and programming competition. Its performance even reach or surpass the OpenAI-01 series in certain sectors. The performance of DeepSeek-R1 is also stunning in sectors related to knowledge, such as creation, writing, and Q&A. DeepSeek also distills inference capabilities to smaller models, such as Qwen and Llama, to fine-tune their inference performance. The 14B model distilled from DeepSeek surpasses the open source QwQ-32B model. The 32B and 70B models distilled from DeepSeek also hit new records. For more information about DeepSeek, see DeepSeek AI GitHub repository.

vLLM

vLLM is a high-performance and easy-to-use LLM inference service framework. vLLM supports most commonly used LLMs, including the Qwen series of models. vLLM is powered by technologies such as PagedAttention optimization, continuous batching, and model quantification to greatly improve the inference efficiency of LLMs. See vLLM GitHub repository.

Arena

Arena is a lightweight client that is used to manage Kubernetes-based machine learning tasks. Arena allows you to streamline data preparation, model development, model training, and model prediction throughout the entire lifecycle of machine learning. This improves the work efficiency of data scientists. Arena is also deeply integrated with the basic services of Alibaba Cloud. It supports GPU sharing and Cloud Paralleled File System (CPFS). Arena can run in deep learning frameworks optimized by Alibaba Cloud. This maximizes the performance and utilization of heterogeneous computing resources provided by Alibaba Cloud. For more information about Arena, see Arena GitHub repository.

Prerequisites

When you first use Container Compute Service (ACS), you must assign the default role to the account. Only after you complete the authorization can ACS call other services, such as ECS, OSS, NAS, CPFS, and SLB, create clusters, and save logs. For more information, see Get started with Container Compute Service.
You have created an ACS cluster. The region and zone of the cluster can provide GPU resources. For more information, see Create an ACS cluster.
You have connected to a Kubernetes cluster using kubectl.
The Arena client is installed. For more information, see Configure the Arena client.

GPU-accelerated instance specification and estimated cost

GPU memory is occupied by model parameters during the inference phase. The usage is calculated based on the following formula.

GPU memory = Number of model parameters x Bytes of precision data

Take a model whose default precision is FP16 and parameter quantity is 7B as an example. The model has 7 billion parameters. Its precision data is 2 bytes (default 16-bit floating number/8 bits per byte).

GPU memory = 7 x 10⁹ x 2 bytes ≈ 13.04 GiB

In addition to the memory used to load the model, you also need to consider the size of the KV cache and the GPU utilization. Typically, a proportion of memory is reserved for buffering. Therefore, the suggested specification is 1GPU with 24 GiB of memory, 8 vCPUs and 32 GiB of memory. You can also refer to the table of suggested specifications and GPU models and specifications. For more information about the billing of ACS GPU-accelerated instances, see Billing overview.

Model name	Model version	Model size	Suggested specification
Model name	Model version	Model size	vCPU	Memory	GPU memory
DeepSeek-R1-Distill-Qwen-1.5B	1.5B (1.5 billion parameters)	3.55 GB	4 or 6	30 GiB	24 GiB
DeepSeek-R1-Distill-Qwen-7B	7B (7 billion parameters)	15.23 GB	6 or 8	32 GiB	24 GiB
DeepSeek-R1-Distill-Llama-8B	8B (8 billion parameters)	16.06 GB	6 or 8	32 GiB	24 GiB
DeepSeek-R1-Distill-Qwen-14B	14B (14 billion parameters)	29.54 GB	Larger than 8	64 GiB	48 GiB
DeepSeek-R1-Distill-Qwen-32B	32B (32 billion parameters)	74.32 GB	Larger than 8	128 GiB	96 GiB
DeepSeek-R1-Distill-Llama-70B	70B (70 billion parameters)	140.56 GB	Larger than 12	128 GiB	192 GiB

Note

Make sure that the specification of the ACS GPU-accelerated instance complies with ACS pod specification adjustment logic.
By default, an ACS pod provides 30 GiB of free EphemeralStorage. The inference image inference-nv-pytorch:25.02-vllm0.7.2-sglang0.4.3.post2-pytorch2.5-cuda12.4-20250305-serverless used in this example is 9.8 GiB in size. If you need more storage space, customize the size of the EphemeralStorage. For more information, see Increase the temporary storage space size.

Procedure

Step 1: Prepare the DeepSeek-R1-Distill-Qwen-7B model files

Note

It usually takes 1 to 2 hours to download and upload the model. You can submit a ticket to copy the model files to your OSS bucket.

Run the following command to download the DeepSeek-R1-Distill-Qwen-7B model from ModelScope.
Note
Check whether the git-lfs plug-in is installed. If not, run yum install git-lfs or apt-get install git-lfs to install it. For more information, see Install git-lfs.
```
git lfs install
GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B.git
cd DeepSeek-R1-Distill-Qwen-7B/
git lfs pull
```

Create an OSS directory and upload the model files to the directory.

Note

To install and use ossutil, see Install ossutil.

ossutil mkdir oss://<your-bucket-name>/models/DeepSeek-R1-Distill-Qwen-7B
ossutil cp -r ./DeepSeek-R1-Distill-Qwen-7B oss://<your-bucket-name>/models/DeepSeek-R1-Distill-Qwen-7B

Create a PV and a PVC. Create a PV named llm-model and a PVC for the cluster. For more information, see Mount a statically provisioned OSS volume.

The following table describes the basic parameters that are used to create the PV.

Parameter	Description
PV Type	OSS
Volume Name	llm-model
Access Certificate	Specify the AccessKey ID and the AccessKey secret used to access the OSS bucket.
Bucket ID	Select the OSS bucket that you created in the previous step.
OSS Path	Select the path of the model, such as `/models/DeepSeek-R1-Distill-Qwen-7B`.

The following table describes the basic parameters that are used to create the PVC.

Parameter	Description
PVC Type	OSS
Name	llm-model
Allocation Mode	In this example, Existing Volumes is selected.
Existing Volumes	Click Existing Volumes and select the PV that you created.

The following code block shows the YAML template:

apiVersion: v1
kind: Secret
metadata:
  name: oss-secret
stringData:
  akId: <your-oss-ak> # The AccessKey ID used to access the OSS bucket.
  akSecret: <your-oss-sk> # The AccessKey secret used to access the OSS bucket.
---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: llm-model
  labels:
    alicloud-pvname: llm-model
spec:
  capacity:
    storage: 30Gi 
  accessModes:
    - ReadOnlyMany
  persistentVolumeReclaimPolicy: Retain
  csi:
    driver: ossplugin.csi.alibabacloud.com
    volumeHandle: llm-model
    nodePublishSecretRef:
      name: oss-secret
      namespace: default
    volumeAttributes:
      bucket: <your-bucket-name> # The name of the OSS bucket.
      url: <your-bucket-endpoint> # The endpoint, such as oss-cn-hangzhou-internal.aliyuncs.com.
      otherOpts: "-o umask=022 -o max_stat_cache_size=0 -o allow_other"
      path: <your-model-path> # The model path, such as /models/DeepSeek-R1-Distill-Qwen-7B/ in this example.
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: llm-model
spec:
  accessModes:
    - ReadOnlyMany
  resources:
    requests:
      storage: 30Gi
  selector:
    matchLabels:
      alicloud-pvname: llm-model

Step 2: Deploy the model

Run the following command to deploy the DeepSeek-R1-Distill-Qwen-7B model inference service that uses the vLLM framework.

The inference service exposes an OpenAI-compatible HTTP API. In the following code block, the --data parameter provided by the Arena client is used to treat the model parameter file as a special dataset and mount it to the specified path (/model/DeepSeek-R1-Distill-Qwen-7B) of the container that runs the inference service. --max_model_len specifies the maximum length of a token that can be processed by the model. You can increase the length to get higher performance. However, this also increases the usage of GPU memory.

Note

Replace the variable in the gpu-model-series=<example-model> command with the actual GPU model supported by ACS. Submit a ticket for the list of GPU models supported by ACS.
egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/{image:tag} is the address of the public image. We recommend that you use VPC to accelerate the pulling of AI container images.

arena serve custom \
--name=deepseek-r1 \
--version=v1 \
--gpus=1 \
--cpu=8 \
--memory=32Gi \
--replicas=1 \
--label=alibabacloud.com/compute-class=gpu \
--label=alibabacloud.com/gpu-model-series=<example-model> \
--restful-port=8000 \
--readiness-probe-action="tcpSocket" \
--readiness-probe-action-option="port: 8000" \
--readiness-probe-option="initialDelaySeconds: 30" \
--readiness-probe-option="periodSeconds: 30" \
--image=egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:25.02-vllm0.7.2-sglang0.4.3.post2-pytorch2.5-cuda12.4-20250305-serverless \
--data=llm-model:/models/DeepSeek-R1-Distill-Qwen-7B \
"vllm serve /models/DeepSeek-R1-Distill-Qwen-7B --port 8000 --trust-remote-code --served-model-name deepseek-r1 --max-model-len 32768 --gpu-memory-utilization 0.95 --enforce-eager"

Expected results:

service/deepseek-r1-v1 created
deployment.apps/deepseek-r1-v1-custom-serving created
INFO[0004] The Job deepseek-r1 has been submitted successfully
INFO[0004] You can run `arena serve get deepseek-r1 --type custom-serving -n default` to check the job status

The following table describes the parameters.

Parameter	Description
--name	The name of the inference service.
--version	The version of the inference service.
--gpus	The number of GPUs used by each inference service replica.
--cpu	The number of vCPUs used by each inference service replica.
--memory	The amount of memory used by each inference service replica.
--replicas	The number of inference service replicas.
--label	Add the following labels to specify ACS GPU compute power. `--label=alibabacloud.com/compute-class=gpu` `--label=alibabacloud.com/gpu-model-series=<example-model>`
--restful-port	The port of the inference service to be exposed.
--readiness-probe-action	The connection type of readiness probes. Valid values: `httpGet`, `exec`, `grpc`, and `tcpSocket`.
--readiness-probe-action-option	The connection method of readiness probes.
--readiness-probe-option	The readiness probe configuration.
--image	The address of the inference service image.
--data	Mount a shared PVC to the runtime environment. The value consists of two parts separated by a colon (:). Specify the name of the PVC on the left side of the colon. You can run the `arena data list` command to view PVCs in the current cluster. Specify the path to which the PVC is mounted on the right side of the colon. The training data will be read from the specified path. This way, your training job can retrieve the data stored in the PV claimed by the PVC.

Run the following command to query the details of the inference service:

arena serve get deepseek-r1

Expected results:

Name:       deepseek-r1
Namespace:  default
Type:       Custom
Version:    v1
Desired:    1
Available:  1
Age:        6h
Address:    10.0.78.27
Port:       RESTFUL:8000
GPU:        1

Instances:
  NAME                                            STATUS   AGE  READY  RESTARTS  GPU  NODE
  ----                                            ------   ---  -----  --------  ---  ----
  deepseek-r1-v1-custom-serving-54d579d994-dqwxz  Running  1h   1/1    0         1    virtual-kubelet-cn-hangzhou-b

Step 3: Verify the inference services

Run kubectl port-forward to configure port forwarding between the local environment and inference service.
Note
Port forwarding set up by using kubectl port-forward is not reliable, secure, or extensible in production environments. It is only for development and debugging. Do not use this command to set up port forwarding in production environments. For more information about networking solutions used for production in ACK clusters, see Ingress management.
```
kubectl port-forward svc/deepseek-r1-v1 8000:8000
```
Expected results:
```
Forwarding from 127.0.0.1:8000 -> 8000
Forwarding from [::1]:8000 -> 8000
```

Send requests to the inference service.

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-r1",
    "messages": [
      {
        "role": "user",
        "content": "Write a letter to my daughter from the future 2035 and tell her to study science and technology well, be the master of science and technology, and promote the development of science and technology and economy. She is now in grade 3."
      }
    ],
    "max_tokens": 1024,
    "temperature": 0.7,
    "top_p": 0.9,
    "seed": 10
  }'

Expected results:

{"id":"chatcmpl-53613fd815da46df92cc9b92cd156146","object":"chat.com pletion","created":1739261570,"model":"deepseek-r1","choices":[{"index":0,"message":{"role":"assistant","content":"<think>\nOK. The user needs to write a letter to his third-grade daughter from 2035 in the future, and convey three key messages at the same time: learn technology well, be the master of technology, and promote technology and economic development. First, I have to consider that the tone of the letter should be kind and hopeful, while reflecting a sense of future technology. \n\nConsidering that the daughter is now in the third grade, the content should not be too complicated and the language should be simple and easy to understand. At the same time, let the daughter feel the importance of science and technology and spike her interests in science and technology. It may be necessary to start from her daily life and give some examples that she may have come into contact with, such as electronic products, the Internet, etc., so that she can resonate more easily. \n\nNext, I have to think about how to structure this letter. It may start with welcoming her to receive this letter, and then introduce the development of future technology, such as smart robots and smart homes. Then it emphasizes the importance of learning science and technology, and encourages her to become the master of science and technology and participate in the development of science and technology. Finally, express the expectations and blessings. \n\nIn terms of content, it is necessary to highlight the impact of technology on life, such as smart assistants, smart homes, new energy vehicles, etc. These are all children may have heard of, but the specific details may need to be simplified to avoid being too technical and keep them interesting. \n\nAt the same time, the letter should mention the impact of science and technology on the economy, such as economic growth, job creation, etc., but it should be presented in a positive and encouraging way, so that the daughter can feel the benefits of science and technology, rather than a simple digital game. \n\nFinally, the ending part should be warm, express her pride and expectation, and encourage her to pursue the future bravely and become a leader in science and technology. \n\nIn general, this letter needs to be educational, interesting and encouraging, using simple and clear language, combined with specific examples of future technology, so that my daughter can feel the charm of technology and the importance of learning in a relaxed reading. \n</think>\n\nDear Future 2035: \n\nHello! \n\nFirst, I want to tell you a good news: the earth has entered a new era! By 2035, technology will no longer be the story of science fiction, but part of our every day life. Today, I am writing this letter to tell you some secrets about the future and how you should live and learn in this world of rapid development of science and technology. \n\n### 1. **Technology is around you**\n In 2035, technology is everywhere. Each of us can have an intelligent assistant, like an always-available teacher, ready to answer your questions. With a simple app, you can control the smart home devices in your home: turn on and off the lights, adjust the temperature, and even cook, all on your instruction! \n   \n   Also, you may have heard about it: intelligent robots. These robots can not only help us to complete the tedious work, but also play a great part in learning and entertainment. They can chat with you, study with you, and even help you solve math problems! Imagine that when you encounter a difficult problem, the robot will patiently teach you how to solve the problem step by step, isn't it great? \n\n### 2. ** the importance of learning science and technology **\n in the future 2035, science and technology has become the main driving force to promote social development. Every industry is being transformed by technology: doctors can use advanced medical equipment early to detect illnesses; teachers can use online classrooms to enable students to learn global knowledge without leaving home; farmers can use smart devices to accurately manage their fields and ensure that every tree receives the best care. \n\n   So, I want to tell you that learning technology is the most important task for every child. Science and technology can not only make you master more knowledge, but also make you become the future master. You will have the opportunity to create new technologies and change our lives! \n\n### 3. **Be the master of science and technology**\n In 2035, the world of science and technology needs everyone's strength. You don't need to be a company executive, just be yourself. You can use your wisdom and hands to promote the development of science and technology. For example, you can participate in technological innovation competitions in schools and design smarter robots; you can invent some small inventions at home to make life more convenient. \n\n   It is important that you have the courage to try new things and explore the unknown. The world of science and technology is infinitely vast, and everyone can find their place here. \n\n### 4. ** About Economy **\n In 2035, the economy will become more prosperous due to the development of science and technology. Smart cities will make our lives more efficient, new energy vehicles will make our travel more environmentally friendly, and medical technology will better protect our health. \n\n   So, when you stand at the beginning of this era, you should know that technology is not only changing the way we live, but also creating opportunities for the future. \n\n### 5. **My expectations**\n    I hope that in the future you can love science and technology, understand science and technology, master science and technology. Not only do you have to learn how to use technology, but you have to understand the principles and the stories behind it. When you grow up, you may become a leader in the field of science and technology, leading us to a brighter future. \n\n   The future world needs you! Are you ready for the challenge? \n\nFinally, I want to tell you that you are smarter, braver and more potential than anyone else today. Although the road ahead is very long, as long as you are willing to work hard, you will certainly be able to realize your dream. \n\nDear daughter in 2035, fight! \n \nYour grandpa ","tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":40,"total_tokens":1034,"completion_tokens":994,"prompt_tokens_details":null}"

(Optional) Step 4: Clear the environment

If you no longer need the inference service, delete the environment promptly.

Delete the inference service.

arena serve delete deepseek-r1

Expected results:

INFO[0007] The serving job deepseek-r1 with version v1 has been deleted successfully

Delete the PV and PVC.

kubectl delete pvc llm-model
kubectl delete pv llm-model

Expected results:

persistentvolumeclaim "llm-model" deleted
persistentvolume "llm-model" deleted

References

Container Compute Service (ACS) is integrated into Container Service for Kubernetes. This allows you to use the computing power of ACS in ACK Pro clusters. For more information about using ACS GPU compute power in ACK, see Use the computing power of ACS in ACK Pro clusters.
For more information about deploying DeepSeek in ACK, see the following topics:
- Deploy an inference service from a DeepSeek distilled model in ACK
For more information about DeepSeek R1 and V3, see the following topics:
- DeepSeek-V3
- DeepSeek-R1
The AI container image of ACS is dedicated to GPU-accelerated containers in ACS clusters. For more information about the release notes of this image, see ACS AI container image release history.