Container Compute Service (ACS) does not require you to have deep knowledge about the underlying hardware or manage GPU-accelerated nodes. All configurations are out-of-the-box. ACS is easy to deploy and billed on a pay-as-you-go basis. It is suitable for LLM inference services, which can efficiently reduce the inference cost. This topic describes how to deploy a model inference service with the DeepSeek full version in ACS.
Background information
DeepSeek-R1
vLLM
ACS
Prerequisites.
An ACS cluster is created. The region and zone of the cluster can provide GPU resources. For more information, see Create an ACS cluster.
A kubectl client is connected to the cluster. For more information, see Obtain the kubeconfig file of a cluster and use kubectl to connect to the cluster.
GPU-accelerated instance specification and estimated cost
No GPU resources are used to accelerate the DeepSeek-R1 full version in ACS and it costs 16 GPU hours to deploy the inference service. Suggested ACS GPU-accelerated instance specification: 16 GPUs (96 GiB of memory per GPU), 64 vCPUs, and 512 GiB of memory. You can also refer to the Table of suggested specifications and GPU models and specifications. For more information about the billing of ACS GPU-accelerated instances, see Billing overview.
Make sure that the specification of the ACS GPU-accelerated instance complies with ACS pod specification adjustment logic.
By default, an ACS pod provides 30 GiB of free EphemeralStorage. The inference image used in this example is large. If you need more storage space, customize the size of the EphemeralStorage. For more information, see Add the EphemeralStorage.
Procedure
Step 1: Prepare the DeepSeek-R1-GPTQ-INT8 model files
The LLM requires large amounts of disk space to store model files. We recommend that you use a NAS or OSS volume to persist the model files. In the following example, an OSS volume is used to persist the DeepSeek-R1-GPTQ-INT8 model files.
Submit a ticket to obtain the model files and YAML content.
Model file: DeepSeek-R1-GPTQ-INT8.
GPU model: Replace the variable in the
alibabacloud.com/gpu-model-series: <example-model>
label with the actual GPU model supported by ACS. For more information, see Specify GPU models and driver versions for ACS GPU-accelerated pods.Base image: Replace the variable in
containers[].image: <base image obtained from the PDSA>
with the actual image address.Secret for pulling images: Obtain and create a Secret, and replace the variable in
imagePullSecrets[].name: <Secret obtained from the PDSA>
with the actual name of the Secret.
(Optional) If you choose to download the model files to your local environment, create a directory in your OSS bucket and upload the model files to the directory.
NoteTo install and use ossutil, see Install ossutil.
ossutil mkdir oss://<your-bucket-name>/models/DeepSeek-R1-GPTQ-INT8 ossutil cp -r /mnt/models/DeepSeek-R1-GPTQ-INT8 oss://<your-bucket-name>/models/DeepSeek-R1-GPTQ-INT8
Create a PV named llm-model and a PVC for the cluster. For more information, see Mount a statically provisioned OSS volume.
Use the console
The following table describes the basic parameters that are used to create the PV.
Parameter
Description
PV Type
OSS
Volume Name
llm-model
Access Certificate
Specify the AccessKey ID and the AccessKey secret used to access the OSS bucket.
Bucket ID
Select the OSS bucket that you created in the previous step.
OSS Path
Select the path of the model, such as /models/DeepSeek-R1-GPTQ-INT8.
The following table describes the basic parameters that are used to create the PVC.
Parameter
Description
PVC Type
OSS
Name
llm-model
Allocation Mode
In this example, Existing Volumes is selected.
Existing Volumes
Click Existing Volumes and select the PV that you created.
Use kubectl
The following code block shows the YAML template:
apiVersion: v1 kind: Secret metadata: name: oss-secret stringData: akId: <your-oss-ak> # The AccessKey ID used to access the OSS bucket. akSecret: <your-oss-sk> # The AccessKey secret used to access the OSS bucket. --- apiVersion: v1 kind: PersistentVolume metadata: name: llm-model labels: alicloud-pvname: llm-model spec: capacity: storage: 30Gi accessModes: - ReadOnlyMany persistentVolumeReclaimPolicy: Retain csi: driver: ossplugin.csi.alibabacloud.com volumeHandle: llm-model nodePublishSecretRef: name: oss-secret namespace: default volumeAttributes: bucket: <your-bucket-name> # The name of the OSS bucket. url: <your-bucket-endpoint> # The endpoint, such as oss-cn-hangzhou-internal.aliyuncs.com. otherOpts: "-o umask=022 -o max_stat_cache_size=0 -o allow_other" path: <your-model-path> # The model path, such as /models/DeepSeek-R1-GPTQ-INT8/ in this example. --- apiVersion: v1 kind: PersistentVolumeClaim metadata: name: llm-model spec: accessModes: - ReadOnlyMany resources: requests: storage: 30Gi selector: matchLabels: alicloud-pvname: llm-model
Step 2: Deploy the model based on ACS GPU compute power
Run the following command to deploy the R1 model inference service that uses the vLLM framework.
The inference service exposes an OpenAI-compatible HTTP API. Run the following code to treat the model parameter file as a special dataset and mount it to the specified path (/data/DeepSeek-R1-GPTQ-INT8) of the container that runs the inference service.
--max_model_len
specifies the maximum length of a token that can be processed by the model. You can increase the length to get higher performance. However, this also increases the usage of GPU memory. We recommend that you set the length to a value around 128000 for the DeepSeek-R1-GPTQ-INT8 model and adjust--gpu-memory-utilization
.apiVersion: apps/v1 kind: Deployment metadata: name: deepseek-r1 namespace: default labels: app: deepseek-r1 spec: replicas: 1 selector: matchLabels: app: deepseek-r1 strategy: type: RollingUpdate rollingUpdate: maxSurge: 25% maxUnavailable: 25% template: metadata: labels: app: deepseek-r1 alibabacloud.com/compute-class: gpu # example-model indicates the GPU model. Replace it with the actual GPU model, such as T4. alibabacloud.com/gpu-model-series: <example-model> spec: imagePullSecrets: - name: <Secret obtained from the PDSA> containers: - name: llm-ds-r1 image: <base image obtained from the PDSA> imagePullPolicy: IfNotPresent command: - sh - -c - "vllm serve /data/DeepSeek-R1-GPTQ-INT8 --port 8000 --trust-remote-code --served-model-name ds --max-model-len 128000 --quantization moe_wna16 --gpu-memory-utilization 0.98 --tensor-parallel-size 16" resources: limits: alibabacloud.com/gpu: "16" cpu: "64" memory: 512Gi requests: alibabacloud.com/gpu: "16" cpu: "64" memory: 512Gi volumeMounts: - name: llm-model mountPath: /data/DeepSeek-R1-GPTQ-INT8 - name: shm mountPath: /dev/shm restartPolicy: Always terminationGracePeriodSeconds: 30 volumes: - name: llm-model persistentVolumeClaim: claimName: llm-model - name: shm emptyDir: medium: Memory sizeLimit: 32Gi --- apiVersion: v1 kind: Service metadata: name: deepseek-r1 spec: type: ClusterIP selector: app: deepseek-r1 ports: - protocol: TCP port: 8000 targetPort: 8000
Step 3: Verify the inference service
Run
kubectl port-forward
to configure port forwarding between the local environment and inference service.NotePort forwarding set up by using
kubectl port-forward
is not reliable, secure, or extensible in production environments. It is only for development and debugging. Do not use this command to set up port forwarding in production environments. For more information about networking solutions used for production in ACK clusters, see Ingress management.kubectl port-forward svc/deepseek-r1 8000:8000
Expected results:
Forwarding from 127.0.0.1:8000 -> 8000 Forwarding from [::1]:8000 -> 8000
Send requests to the inference service.
curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "ds", "messages": [ { "role": "user", "content": "Write a letter to my daughter from the future 2035 and tell her to study science and technology well, be the master of science and technology, and promote the development of science and technology and economy. She is now in grade 3." } ], "max_tokens": 1024, "temperature": 0.7, "top_p": 0.9, "seed": 10 }'
Expected results:
,{"id":"chatcmpl-53613fd815da46df92cc9b92cd156146","object":"chat.com pletion","created":1739261570,"model":"deepseek-r1","choices":[{"index":0,"message":{"role":"assistant","content":"<think>\nOK. The user needs to write a letter to his third-grade daughter from 2035 in the future, and convey three key messages at the same time: learn technology well, be the master of technology, and promote technology and economic development. First, I have to consider that the tone of the letter should be kind and hopeful, while reflecting a sense of future technology. \n\nConsidering that the daughter is now in the third grade, the content should not be too complicated and the language should be simple and easy to understand. At the same time, let the daughter feel the importance of science and technology and spike her interests in science and technology. It may be necessary to start from her daily life and give some examples that she may have come into contact with, such as electronic products, the Internet, etc., so that she can resonate more easily. \n\nNext, I have to think about how to structure this letter. It may start with welcoming her to receive this letter, and then introduce the development of future technology, such as smart robots and smart homes. Then it emphasizes the importance of learning science and technology, and encourages her to become the master of science and technology and participate in the development of science and technology. Finally, express the expectations and blessings. \n\nIn terms of content, it is necessary to highlight the impact of technology on life, such as smart assistants, smart homes, new energy vehicles, etc. These are all children may have heard of, but the specific details may need to be simplified to avoid being too technical and keep them interesting. \n\nAt the same time, the letter should mention the impact of science and technology on the economy, such as economic growth, job creation, etc., but it should be presented in a positive and encouraging way, so that the daughter can feel the benefits of science and technology, rather than a simple digital game. \n\nFinally, the ending part should be warm, express her pride and expectation, and encourage her to pursue the future bravely and become a leader in science and technology. \n\nIn general, this letter needs to be educational, interesting and encouraging, using simple and clear language, combined with specific examples of future technology, so that my daughter can feel the charm of technology and the importance of learning in a relaxed reading. \n</think>\n\nDear Future 2035: \n\nHello! \n\nFirst, I want to tell you a good news: the earth has entered a new era! By 2035, technology will no longer be the story of science fiction, but part of our every day life. Today, I am writing this letter to tell you some secrets about the future and how you should live and learn in this world of rapid development of science and technology. \n\n### 1. **Technology is around you**\n In 2035, technology is everywhere. Each of us can have an intelligent assistant, like an always-available teacher, ready to answer your questions. With a simple app, you can control the smart home devices in your home: turn on and off the lights, adjust the temperature, and even cook, all on your instruction! \n \n Also, you may have heard about it: intelligent robots. These robots can not only help us to complete the tedious work, but also play a great part in learning and entertainment. They can chat with you, study with you, and even help you solve math problems! Imagine that when you encounter a difficult problem, the robot will patiently teach you how to solve the problem step by step, isn't it great? \n\n### 2. ** the importance of learning science and technology **\n in the future 2035, science and technology has become the main driving force to promote social development. Every industry is being transformed by technology: doctors can use advanced medical equipment early to detect illnesses; teachers can use online classrooms to enable students to learn global knowledge without leaving home; farmers can use smart devices to accurately manage their fields and ensure that every tree receives the best care. So, I want to tell you that learning technology is the most important task for every child. Science and technology can not only make you master more knowledge, but also make you become the future master. You will have the opportunity to create new technologies and change our lives! \n\n### 3. **Be the master of science and technology**\n In 2035, the world of science and technology needs everyone's strength. You don't need to be a company executive, just be yourself. You can use your wisdom and hands to promote the development of science and technology. For example, you can participate in technological innovation competitions in schools and design smarter robots; you can invent some small inventions at home to make life more convenient. \n\n It is important that you have the courage to try new things and explore the unknown. The world of science and technology is infinitely vast, and everyone can find their place here. \n\n### 4. ** About Economy **\n In 2035, the economy will become more prosperous due to the development of science and technology. Smart cities will make our lives more efficient, new energy vehicles will make our travel more environmentally friendly, and medical technology will better protect our health. \n\n So, when you stand at the beginning of this era, you should know that technology is not only changing the way we live, but also creating opportunities for the future. \n\n### 5. **My expectations**\n I hope that in the future you can love science and technology, understand science and technology, master science and technology. Not only do you have to learn how to use technology, but you have to understand the principles and the stories behind it. When you grow up, you may become a leader in the field of science and technology, leading us to a brighter future. \n\n The future world needs you! Are you ready for the challenge? \n\nFinally, I want to tell you that you are smarter, braver and more potential than anyone else today. Although the road ahead is very long, as long as you are willing to work hard, you will certainly be able to realize your dream. \n\nDear daughter in 2035, fight! \n \nYour grandpa ","tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":40,"total_tokens":1034,"completion_tokens":994,"prompt_tokens_details":null}"
(Optional) Step 4: Delete the environment
If you no longer need the inference service, delete the environment promptly.
Delete the inference workload and service.
kubectl delete deployment deepseek-r1 kubectl delete service deepseek-r1
Delete the PV and PVC.
kubectl delete pvc llm-model kubectl delete pv llm-model
Expected results:
persistentvolumeclaim "llm-model" deleted persistentvolume "llm-model" deleted
References
Container Compute Service (ACS) is integrated into Container Service for Kubernetes. This allows you to use the computing power of ACS in ACK Pro clusters. For more information about using ACS GPU compute power in ACK, see Use the computing power of ACS in ACK Pro clusters.
For more information about deploying DeepSeek in ACK, see the following topics:
For more information about DeepSeek R1 and V3, see the following topics: