Container Compute Service (ACS) provides GPU-accelerated compute power without requiring deep knowledge of underlying hardware or GPU node management. ACS offers simple deployment and pay-as-you-go billing, making it ideal for large language model (LLM) inference tasks while effectively reducing costs. This topic describes how to deploy a production-ready DeepSeek-R1 full model inference service using ACS.
Background
DeepSeek-R1 model
DeepSeek-R1 is DeepSeek's first-generation reasoning model, designed to enhance reasoning capabilities in LLMs through large-scale reinforcement learning. Experimental results show that DeepSeek-R1 excels in mathematical reasoning, programming competitions, and other tasks, not only outperforming other closed-source models but also matching or exceeding the OpenAI-O1 series in certain tasks. DeepSeek-R1 also performs well in knowledge-based tasks and other task types, including creative writing and general question and answer. For more information about DeepSeek models, see the DeepSeek AI GitHub repository.
vLLM
vLLM is an efficient, easy-to-use inference framework for LLMs that supports various popular models, including Qwen. vLLM achieves high inference efficiency through optimizations such as PagedAttention, continuous batching, and model quantization. For more information about vLLM, see the vLLM GitHub repository.
Container Compute Service (ACS)
ACS provides accessible, flexible, and elastic next-generation container compute power. As a Kubernetes-based container service, ACS offers general-purpose and heterogeneous compute power that conforms to container specifications. ACS delivers compute power in a serverless model, eliminating the need to manage underlying nodes or clusters. Through integrated scheduling, container runtime, storage, and networking capabilities, ACS reduces operational complexity while optimizing elasticity and flexibility. Pay-as-you-go billing and elastic scaling capabilities significantly reduce resource costs. For LLM inference scenarios, ACS data and image acceleration capabilities further optimize model startup time and resource costs.
Prerequisites
Before you begin:
Create an ACS cluster with the default service role assigned. See Create an ACS cluster.
Configure kubectl to connect to your ACS cluster. See Obtain a cluster kubeconfig file and use kubectl to connect to the cluster.
GPU instance specifications and cost estimation
Deploying the DeepSeek-R1 full model on ACS requires 16 GPU hours without any acceleration. We recommend the following resource configuration for single-instance deployment:
Resource | Specification |
GPU | 16 cards (96 GiB memory per card) |
CPU | 64 vCPU |
Memory | 512 GiB |
For information about selecting instance specifications, see GPU support in ACS. For billing information, see Billing overview.
ACS GPU instance specifications follow the ACS Pod specification logic.
ACS Pods provide 30 GiB of free ephemeral storage by default. The inference image used in this topic requires more storage space. To increase ephemeral storage, see Overview of ACS pod instances.
Step 1: Prepare the DeepSeek-R1-GPTQ-INT8 model files
LLMs require substantial disk space due to their large parameter counts. We recommend using NAS or OSS storage volumes to persistently store model files. The following steps use OSS storage as an example.
Note: Submit a ticket to obtain the model files and YAML deployment configuration, including:
Model files: DeepSeek-R1-GPTQ-INT8
GPU model: Replace the label
alibabacloud.com/gpu-model-serieswith the actual GPU model supported by ACSBase image: Replace the container image with the actual image address
Image pull secret: Create a Secret and replace the imagePullSecrets name with the actual Secret name
(Optional) Upload the model to OSS
If you downloaded the model files locally, create a directory in OSS and upload the model:
ossutil mkdir oss://<your-bucket-name>/models/DeepSeek-R1-GPTQ-INT8
ossutil cp -r /mnt/models/DeepSeek-R1-GPTQ-INT8 oss://<your-bucket-name>/models/DeepSeek-R1-GPTQ-INT8For information about installing and using ossutil, see Install ossutil.
Create a PV and PVC
Create a PersistentVolume (PV) and PersistentVolumeClaim (PVC) named llm-model for your cluster. For more information, see Use an ossfs 1.0 statically provisioned volume.
Console
Configure the PV with the following settings:
Parameter | Value |
Storage Volume Type | OSS |
Name | llm-model |
Access Credentials | AccessKey ID and AccessKey Secret for OSS access |
Bucket ID | Select your OSS bucket |
OSS Path | /models/DeepSeek-R1-GPTQ-INT8 |
Configure the PVC with the following settings:
Parameter | Value |
Storage Claim Type | OSS |
Name | llm-model |
Allocation Mode | Existing Storage Volume |
Existing Storage Volume | Select the PV you created |
kubectl
apiVersion: v1
kind: Secret
metadata:
name: oss-secret
stringData:
akId: <your-oss-ak>
akSecret: <your-oss-sk>
---
apiVersion: v1
kind: PersistentVolume
metadata:
name: llm-model
labels:
alicloud-pvname: llm-model
spec:
capacity:
storage: 30Gi
accessModes:
- ReadOnlyMany
persistentVolumeReclaimPolicy: Retain
csi:
driver: ossplugin.csi.alibabacloud.com
volumeHandle: llm-model
nodePublishSecretRef:
name: oss-secret
namespace: default
volumeAttributes:
bucket: <your-bucket-name>
url: <your-bucket-endpoint>
otherOpts: "-o umask=022 -o max_stat_cache_size=0 -o allow_other"
path: /models/DeepSeek-R1-GPTQ-INT8/
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: llm-model
spec:
accessModes:
- ReadOnlyMany
resources:
requests:
storage: 30Gi
selector:
matchLabels:
alicloud-pvname: llm-modelStep 2: Deploy the model using ACS GPU
Run the following command to deploy the DeepSeek-R1-GPTQ-INT8 model inference service using the vLLM framework with RDMA acceleration. The inference service exposes an OpenAI-compatible HTTP API.
Note:
The max-model-len parameter sets the maximum token length the model can process. Increasing this value improves conversation quality but requires more GPU memory. For DeepSeek-R1-GPTQ-INT8, we recommend a maximum context length of around 128,000 tokens.
RDMA (Remote Direct Memory Access) enables zero-copy and kernel bypass, achieving lower latency, higher throughput, and lower CPU usage compared to TCP/IP. To use RDMA, add the label alibabacloud.com/hpn-type: rdma. Submit a ticket to confirm which GPU models support RDMA.
apiVersion: apps/v1
kind: Deployment
metadata:
name: deepseek-r1
namespace: default
labels:
app: deepseek-r1
spec:
replicas: 1
selector:
matchLabels:
app: deepseek-r1
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 25%
maxUnavailable: 25%
template:
metadata:
labels:
app: deepseek-r1
alibabacloud.com/compute-class: gpu
alibabacloud.com/gpu-model-series: <example-model>
alibabacloud.com/hpn-type: "rdma"
spec:
imagePullSecrets:
- name: <your-secret-name>
containers:
- name: llm-ds-r1
image: <your-image-address>
imagePullPolicy: IfNotPresent
command:
- sh
- -c
- "vllm serve /data/DeepSeek-R1-GPTQ-INT8 --port 8000 --trust-remote-code --served-model-name ds --max-model-len 128000 --quantization moe_wna16 --gpu-memory-utilization 0.98 --tensor-parallel-size 16"
resources:
limits:
alibabacloud.com/gpu: "16"
cpu: "64"
memory: 512Gi
requests:
alibabacloud.com/gpu: "16"
cpu: "64"
memory: 512Gi
volumeMounts:
- name: llm-model
mountPath: /data/DeepSeek-R1-GPTQ-INT8
- name: shm
mountPath: /dev/shm
restartPolicy: Always
terminationGracePeriodSeconds: 30
volumes:
- name: llm-model
persistentVolumeClaim:
claimName: llm-model
- name: shm
emptyDir:
medium: Memory
sizeLimit: 32Gi
---
apiVersion: v1
kind: Service
metadata:
name: deepseek-r1
spec:
type: ClusterIP
selector:
app: deepseek-r1
ports:
- protocol: TCP
port: 8000
targetPort: 8000Apply the configuration:
kubectl apply -f deepseek-r1-deployment.yamlStep 3: Verify the inference service
Set up port forwarding between the inference service and your local environment:
kubectl port-forward svc/deepseek-r1 8000:8000Expected output:
Forwarding from 127.0.0.1:8000 -> 8000 Forwarding from [::1]:8000 -> 8000Important: Port forwarding with kubectl is suitable for development and debugging only. It does not provide production-level reliability, security, or scalability. For production environments, see Quick Start for ALB Ingress.
Send a test inference request:
curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "ds", "messages": [ { "role": "user", "content": "Explain the concept of reinforcement learning in simple terms." } ], "max_tokens": 1024, "temperature": 0.7, "top_p": 0.9, "seed": 10 }'A successful response indicates that the inference service is running correctly.
Result
You have successfully deployed a DeepSeek-R1 full model inference service on ACS with GPU acceleration. The service exposes an OpenAI-compatible API that you can integrate with your applications.