This topic describes how to use KServe to deploy a production-ready DeepSeek model inference service in Alibaba Cloud Container Service for Kubernetes (ACK).
Background information
DeepSeek-R1 model
KServe
Arena
Prerequisites
You have created a Kubernetes cluster that contains GPUs. For more information, see Add a GPU node pool to a cluster.
You have connected to the cluster using kubectl. For more information, see Connect to a cluster using kubectl.
You have installed the ack-kserve component. For more information, see Install the ack-kserve component.
You have installed the Arena client. For more information, see Configure the Arena client.
GPU instance specifications and cost estimation
Model parameters are the main consumer of GPU memory during the inference phase. You can calculate the required GPU memory using the following formula:
For example, consider a 7B model with a default precision of FP16. The number of model parameters is 7 billion. The number of bytes for the data type precision is 2 bytes (16-bit floating-point number / 8 bits per byte).
In addition to the GPU memory required to load the model, you must also consider the KV Cache size and GPU utilization during computation. A buffer is typically reserved. Therefore, we recommend that you use a GPU-accelerated instance with 24 GiB of GPU memory, such as ecs.gn7i-c8g1.2xlarge or ecs.gn7i-c16g1.4xlarge. For more information about GPU-accelerated instance types and billing, see GPU-accelerated computed optimized instance family and Elastic GPU Service billing.
Model deployment
Step 1: Prepare the DeepSeek-R1-Distill-Qwen-7B model files
Run the following command to download the DeepSeek-R1-Distill-Qwen-7B model from ModelScope.
NoteMake sure that the git-lfs plug-in is installed. If it is not installed, you can run
yum install git-lfsorapt-get install git-lfs. For more information about installation methods, see Install Git Large File Storage.git lfs install GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B.git cd DeepSeek-R1-Distill-Qwen-7B/ git lfs pullCreate a directory in OSS and upload the model to OSS.
NoteFor more information about how to install and use ossutil, see Install ossutil.
ossutil mkdir oss://<your-bucket-name>/models/DeepSeek-R1-Distill-Qwen-7B ossutil cp -r ./DeepSeek-R1-Distill-Qwen-7B oss://<your-bucket-name>/models/DeepSeek-R1-Distill-Qwen-7BCreate a persistent volume (PV) and a persistent volume claim (PVC). Configure a PV and a PVC named
llm-modelfor the target cluster. For more information, see Use ossfs 1.0 statically provisioned volumes.Console example
The following table describes the basic configurations for the sample PV.
Configuration item
Description
PV Type
OSS
Name
llm-model
Access Certificate
Configure the AccessKey ID and AccessKey secret to access OSS.
Bucket ID
Select the OSS bucket that you created in the previous step.
OSS Path
Select the path where the model is stored, such as
/models/DeepSeek-R1-Distill-Qwen-7B.The following table describes the basic configurations for the sample PVC.
Configuration item
Description
PVC Type
OSS
Name
llm-model
Allocation Mode
Select an existing PV.
Existing Volumes
Click the link to select an existing PV, and then select the PV that you created.
kubectl example
The following is a sample YAML file:
apiVersion: v1 kind: Secret metadata: name: oss-secret stringData: akId: <your-oss-ak> # The AccessKey ID to access OSS. akSecret: <your-oss-sk> # The AccessKey secret to access OSS. --- apiVersion: v1 kind: PersistentVolume metadata: name: llm-model labels: alicloud-pvname: llm-model spec: capacity: storage: 30Gi accessModes: - ReadOnlyMany persistentVolumeReclaimPolicy: Retain csi: driver: ossplugin.csi.alibabacloud.com volumeHandle: llm-model nodePublishSecretRef: name: oss-secret namespace: default volumeAttributes: bucket: <your-bucket-name> # The bucket name. url: <your-bucket-endpoint> # The Endpoint, such as oss-cn-hangzhou-internal.aliyuncs.com. otherOpts: "-o umask=022 -o max_stat_cache_size=0 -o allow_other" path: <your-model-path> # In this example, the path is /models/DeepSeek-R1-Distill-Qwen-7B/. --- apiVersion: v1 kind: PersistentVolumeClaim metadata: name: llm-model spec: accessModes: - ReadOnlyMany resources: requests: storage: 30Gi selector: matchLabels: alicloud-pvname: llm-model
Step 2: Deploy the inference service
Run the following command to start the inference service named deepseek.
arena serve kserve \ --name=deepseek \ --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm:v0.6.6 \ --gpus=1 \ --cpu=4 \ --memory=12Gi \ --data=llm-model:/models/DeepSeek-R1-Distill-Qwen-7B \ "vllm serve /models/DeepSeek-R1-Distill-Qwen-7B --port 8080 --trust-remote-code --served-model-name deepseek-r1 --max-model-len 32768 --gpu-memory-utilization 0.95 --enforce-eager"The following table describes the parameters.
Parameter
Required
Description
--name
Yes
The name of the inference service to submit. The name must be globally unique.
--image
Yes
The image address of the inference service.
--gpus
No
The number of GPUs required by the inference service. The default value is 0.
--cpu
No
The number of CPUs required by the inference service.
--memory
No
The amount of memory required by the inference service.
--data
No
The model path of the service. In this topic, the model is specified as the
llm-modelPV that you created in the previous step. The PV is mounted to the /models/ directory in the container.Expected output:
inferenceservice.serving.kserve.io/deepseek created INFO[0003] The Job deepseek has been submitted successfully INFO[0003] You can run `arena serve get deepseek --type kserve -n default` to check the job status
Step 3: Verify the inference service
Run the following command to check the deployment status of the KServe inference service.
arena serve get deepseekExpected output:
Name: deepseek Namespace: default Type: KServe Version: 1 Desired: 1 Available: 1 Age: 3m Address: http://deepseek-default.example.com Port: :80 GPU: 1 Instances: NAME STATUS AGE READY RESTARTS GPU NODE ---- ------ --- ----- -------- --- ---- deepseek-predictor-7cd4d568fd-fznfg Running 3m 1/1 0 1 cn-beijing.172.16.1.77The output indicates that the KServe inference service is deployed.
Run the following command to use the IP address of the NGINX Ingress gateway to access the inference service.
# Obtain the IP address of the NGINX Ingress. NGINX_INGRESS_IP=$(kubectl -n kube-system get svc nginx-ingress-lb -ojsonpath='{.status.loadBalancer.ingress[0].ip}') # Obtain the hostname of the inference service. SERVICE_HOSTNAME=$(kubectl get inferenceservice deepseek -o jsonpath='{.status.url}' | cut -d "/" -f 3) # Send a request to access the inference service. curl -H "Host: $SERVICE_HOSTNAME" -H "Content-Type: application/json" http://$NGINX_INGRESS_IP:80/v1/chat/completions -d '{"model": "deepseek-r1", "messages": [{"role": "user", "content": "Say this is a test!"}], "max_tokens": 512, "temperature": 0.7, "top_p": 0.9, "seed": 10}'Expected output:
{"id":"chatcmpl-0fe3044126252c994d470e84807d4a0a","object":"chat.completion","created":1738828016,"model":"deepseek-r1","choices":[{"index":0,"message":{"role":"assistant","content":"<think>\n\n</think>\n\nIt seems like you're testing or sharing some information. How can I assist you further? If you have any questions or need help with something, feel free to ask!","tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":9,"total_tokens":48,"completion_tokens":39,"prompt_tokens_details":null},"prompt_logprobs":null}
Observability
Observability for LLM inference services in a production environment is crucial for proactively finding and resolving issues. The vLLM framework provides many LLM inference metrics. For more information, see the Metrics document. KServe also provides metrics to help monitor the performance and health of model services. These capabilities are integrated into Arena. You can add the --enable-prometheus=true parameter when you submit the application to enable them.
arena serve kserve \
--name=deepseek \
--image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm:v0.6.6 \
--gpus=1 \
--cpu=4 \
--memory=12Gi \
--enable-prometheus=true \
--data=llm-model:/models/DeepSeek-R1-Distill-Qwen-7B \
"vllm serve /models/DeepSeek-R1-Distill-Qwen-7B --port 8080 --trust-remote-code --served-model-name deepseek-r1 --max-model-len 32768 --gpu-memory-utilization 0.95 --enforce-eager"You can use a Grafana dashboard to monitor the LLM inference service deployed with vLLM. To do so, import the vLLM Grafana JSON model into Grafana and then create an observability dashboard for the LLM inference service. You can obtain the JSON model from the vLLM official website. The configured dashboard appears as follows:

Procedure for importing a Grafana dashboard
Elastic Scaling
When you deploy and manage KServe model services, you may experience dynamic load fluctuations. KServe uses the Kubernetes Horizontal Pod Autoscaler (HPA) and the ack-alibaba-cloud-metrics-adapter component from ACK to automatically scale the number of model service pods based on CPU, memory, and GPU utilization, along with custom performance metrics, to ensure service stability and efficiency. For more information, see Configure auto scaling for a service.
Model acceleration
With the development of technology, the size of models used in AI applications is increasing. When you pull large files from storage services such as Object Storage Service (OSS) and File Storage NAS (NAS), issues such as high latency or cold starts may occur. You can use Fluid to significantly accelerate model loading speed and optimize the performance of inference services, especially KServe-based inference services. For more information, see Use Fluid for model acceleration.
Phased release
Phased release is a critical strategy in production environments to ensure business stability and minimize risks associated with changes. ACK supports various phased release policies, including traffic percentage-based and request header-based approaches. For more information, see Implement phased release for inference services.
GPU sharing inference
The DeepSeek-R1-Distill-Qwen-7B model requires only 14 GB of GPU memory. If you use a higher-spec GPU, consider using GPU sharing inference technology to improve GPU utilization. This technology partitions a GPU so that multiple inference services can share it, which improves overall GPU utilization. For more information, see Deploy GPU sharing inference services.





