Deploy DeepSeek-R1 Distilled Models for Production Inference via KServe - Container Service for Kubernetes

Deploy a production-ready DeepSeek-R1-Distill-Qwen-7B inference service on Alibaba Cloud Container Service for Kubernetes (ACK) using KServe and Arena. This guide covers GPU sizing, model preparation, service deployment, verification, and observability setup.

Background

DeepSeek-R1

DeepSeek-R1 is DeepSeek's first-generation inference model. It improves the inference capabilities of large language models (LLMs) through large-scale reinforcement learning, and performs well across mathematical reasoning, programming competitions, knowledge-based tasks, creative writing, and general Q&A — surpassing other closed-source models and approaching or exceeding the OpenAI-O1 series on certain tasks.

DeepSeek also distills inference capabilities into smaller models by fine-tuning Qwen and Llama. The distilled 14B model significantly outperforms the open-source QwQ-32B, and the 32B and 70B distilled models have set new records.

For more information, see the DeepSeek AI GitHub repository.

KServe

KServe is an open-source, cloud-native model serving platform for Kubernetes. It supports multiple ML frameworks, provides elastic scaling, and uses declarative YAML APIs to simplify model deployment and management. For more information, see KServe.

Arena

Arena is a lightweight, Kubernetes-based solution for the full ML lifecycle — from data preparation and model development to training and inference. It integrates with Alibaba Cloud services and supports GPU sharing and CPFS. For more information, see the Arena GitHub repository.

Prerequisites

Before you begin, ensure that you have:

A Kubernetes cluster with GPU nodes — see Add a GPU node pool to a cluster
kubectl connected to the cluster — see Connect to a cluster using kubectl
The ack-kserve component installed — see Install the ack-kserve component
The Arena client configured — see Configure the Arena client

GPU sizing

Model parameters are the main consumer of GPU memory during inference. Use the following formula to estimate the required GPU memory:

GPU memory = Number of parameters × Bytes per parameter

For a 7B model at FP16 precision: 7 × 10⁹ × 2 bytes ≈ 13.04 GiB

Beyond loading the model weights, you need additional GPU memory for the KV cache and computation. Use a GPU instance with at least 24 GiB of GPU memory, such as ecs.gn7i-c8g1.2xlarge or ecs.gn7i-c16g1.4xlarge.

For instance specifications and pricing, see GPU-accelerated compute optimized instance family and Elastic GPU Service billing.

Deploy the inference service

Step 1: Prepare model files

Download the DeepSeek-R1-Distill-Qwen-7B model from ModelScope.
Make sure git-lfs is installed before cloning. Install it with yum install git-lfs or apt-get install git-lfs. For more options, see Install Git Large File Storage.
```
git lfs install
GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B.git
cd DeepSeek-R1-Distill-Qwen-7B/
git lfs pull
```

Upload the model to an OSS bucket.

For ossutil installation and usage, see Install ossutil.

ossutil mkdir oss://<your-bucket-name>/models/DeepSeek-R1-Distill-Qwen-7B
ossutil cp -r ./DeepSeek-R1-Distill-Qwen-7B oss://<your-bucket-name>/models/DeepSeek-R1-Distill-Qwen-7B

Create a persistent volume (PV) and persistent volume claim (PVC) named llm-model to mount the model into the cluster. For step-by-step instructions, see Use ossfs 1.0 statically provisioned volumes.

Console

Configure the PV with the following settings:

Configuration item	Value
PV type	OSS
Name	llm-model
Access certificate	AccessKey ID and AccessKey secret for OSS
Bucket ID	The OSS bucket created in the previous step
OSS path	`/models/DeepSeek-R1-Distill-Qwen-7B`

Configure the PVC with the following settings:

Configuration item	Value
PVC type	OSS
Name	llm-model
Allocation mode	Select an existing PV
Existing volumes	Select the PV created above

Kubectl

Apply the following YAML to create a Secret, PV, and PVC:

apiVersion: v1
kind: Secret
metadata:
  name: oss-secret
stringData:
  akId: <your-oss-ak>       # AccessKey ID for OSS
  akSecret: <your-oss-sk>   # AccessKey secret for OSS
---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: llm-model
  labels:
    alicloud-pvname: llm-model
spec:
  capacity:
    storage: 30Gi
  accessModes:
    - ReadOnlyMany
  persistentVolumeReclaimPolicy: Retain
  csi:
    driver: ossplugin.csi.alibabacloud.com
    volumeHandle: llm-model
    nodePublishSecretRef:
      name: oss-secret
      namespace: default
    volumeAttributes:
      bucket: <your-bucket-name>      # The bucket name
      url: <your-bucket-endpoint>     # Endpoint, e.g., oss-cn-hangzhou-internal.aliyuncs.com
      otherOpts: "-o umask=022 -o max_stat_cache_size=0 -o allow_other"
      path: <your-model-path>         # e.g., /models/DeepSeek-R1-Distill-Qwen-7B/
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: llm-model
spec:
  accessModes:
    - ReadOnlyMany
  resources:
    requests:
      storage: 30Gi
  selector:
    matchLabels:
      alicloud-pvname: llm-model

Step 2: Deploy the inference service

Run the following command to start the inference service.

arena serve kserve \
    --name=deepseek \
    --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm:v0.6.6 \
    --gpus=1 \
    --cpu=4 \
    --memory=12Gi \
    --data=llm-model:/models/DeepSeek-R1-Distill-Qwen-7B \
    "vllm serve /models/DeepSeek-R1-Distill-Qwen-7B --port 8080 --trust-remote-code --served-model-name deepseek-r1 --max-model-len 32768 --gpu-memory-utilization 0.95 --enforce-eager"

Arena parameters

Parameter	Required	Description
`--name`	Yes	Name of the inference service. Must be globally unique.
`--image`	Yes	Container image for the inference service.
`--gpus`	No	Number of GPUs to allocate. Default: `0`.
`--cpu`	No	Number of CPUs to allocate.
`--memory`	No	Amount of memory to allocate.
`--data`	No	Mounts the `llm-model` PVC to `/models/DeepSeek-R1-Distill-Qwen-7B` in the container.

vLLM serve parameters

Parameter	Description
`--port 8080`	Port for the vLLM server to listen on.
`--trust-remote-code`	Allows loading custom model code from the model repository. Required for DeepSeek models.
`--served-model-name deepseek-r1`	Model name exposed in the API. Used as the `model` field in API requests.
`--max-model-len 32768`	Maximum token length for input + output. Set to 32,768 to balance context window and GPU memory.
`--gpu-memory-utilization 0.95`	Fraction of GPU memory reserved for the model and KV cache. Set to 0.95 to maximize memory use while leaving headroom.
`--enforce-eager`	Disables CUDA graph capture and runs in eager execution mode. Reduces startup memory overhead, useful when GPU memory is tight.

Expected output:

inferenceservice.serving.kserve.io/deepseek created
INFO[0003] The Job deepseek has been submitted successfully
INFO[0003] You can run `arena serve get deepseek --type kserve -n default` to check the job status

Step 3: Verify the deployment

Check the service status.

arena serve get deepseek

Expected output:

Name:       deepseek
Namespace:  default
Type:       KServe
Version:    1
Desired:    1
Available:  1
Age:        3m
Address:    http://deepseek-default.example.com
Port:       :80
GPU:        1


Instances:
  NAME                                 STATUS   AGE  READY  RESTARTS  GPU  NODE
  ----                                 ------   ---  -----  --------  ---  ----
  deepseek-predictor-7cd4d568fd-fznfg  Running  3m   1/1    0         1    cn-beijing.172.16.1.77

The Available: 1 and Running status confirm the service is ready.

Send a test request via the NGINX Ingress gateway.

# Get the NGINX Ingress IP address
NGINX_INGRESS_IP=$(kubectl -n kube-system get svc nginx-ingress-lb -ojsonpath='{.status.loadBalancer.ingress[0].ip}')
# Get the inference service hostname
SERVICE_HOSTNAME=$(kubectl get inferenceservice deepseek -o jsonpath='{.status.url}' | cut -d "/" -f 3)
# Send a chat completion request
curl -H "Host: $SERVICE_HOSTNAME" \
     -H "Content-Type: application/json" \
     http://$NGINX_INGRESS_IP:80/v1/chat/completions \
     -d '{"model": "deepseek-r1", "messages": [{"role": "user", "content": "Say this is a test!"}], "max_tokens": 512, "temperature": 0.7, "top_p": 0.9, "seed": 10}'

Expected output:

{"id":"chatcmpl-0fe3044126252c994d470e84807d4a0a","object":"chat.completion","created":1738828016,"model":"deepseek-r1","choices":[{"index":0,"message":{"role":"assistant","content":"<think>\n\n</think>\n\nIt seems like you're testing or sharing some information. How can I assist you further? If you have any questions or need help with something, feel free to ask!","tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":9,"total_tokens":48,"completion_tokens":39,"prompt_tokens_details":null},"prompt_logprobs":null}

Observability

LLM inference services generate a rich set of metrics. vLLM exposes inference metrics such as token throughput and latency; KServe adds service health and performance metrics. Both are integrated into Arena and can be enabled with a single flag.

To enable Prometheus metrics, add --enable-prometheus=true to the arena serve kserve command:

arena serve kserve \
    --name=deepseek \
    --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm:v0.6.6 \
    --gpus=1 \
    --cpu=4 \
    --memory=12Gi \
    --enable-prometheus=true \
    --data=llm-model:/models/DeepSeek-R1-Distill-Qwen-7B \
    "vllm serve /models/DeepSeek-R1-Distill-Qwen-7B --port 8080 --trust-remote-code --served-model-name deepseek-r1 --max-model-len 32768 --gpu-memory-utilization 0.95 --enforce-eager"

For the full list of vLLM metrics, see the vLLM Metrics documentation.

Import a Grafana dashboard

Visualize inference metrics using the vLLM Grafana dashboard. The configured dashboard looks like this:

Import the dashboard

Log on to the ARMS console.
In the left navigation pane, click Integration Management.
On the Integrated Environments tab, select Container Service, search for your ACK cluster by name, and click the target environment.
On the Component Management tab, copy the Cluster ID, then click the link next to Dashboard Directory.
To the right of the Dashboards tab, click Import.
Copy the contents of the grafana.json file, paste it into the Import via panel json area, and click Load.
You can also import by uploading the JSON file directly.
Keep the default settings and click Import.

Verify the dashboard data

Search for the data source using the copied Cluster ID or the Prometheus instance ID, and select the target data source.
Send several requests to the inference service to simulate traffic, then verify metrics such as Token Throughput on the dashboard.

Production enhancements

Elastic scaling

If your inference workload has variable traffic, use KServe's Horizontal Pod Autoscaler (HPA) integration with the ack-alibaba-cloud-metrics-adapter component from ACK to automatically scale pods based on CPU, memory, GPU utilization, and custom metrics. This keeps the service stable during traffic spikes without over-provisioning.

For more information, see Configure auto scaling for a service.

Model acceleration

When the cluster pulls large model files from OSS or NAS, high latency and cold start delays can affect service availability. Use Fluid to cache model data close to the inference pods, significantly reducing load time.

For more information, see Use Fluid for model acceleration.

Phased release

When updating a model or configuration in production, use phased release to roll out changes to a subset of traffic first and validate stability before a full rollout. ACK supports both traffic percentage-based and request header-based phased release strategies.

For more information, see Implement phased release for inference services.

GPU sharing

The DeepSeek-R1-Distill-Qwen-7B model requires approximately 14 GB of GPU memory. If you are using a higher-spec GPU, consider GPU sharing to run multiple inference services on a single GPU and improve overall GPU utilization.

For more information, see Deploy GPU sharing inference services.

Container Service for Kubernetes:Deploy a DeepSeek distilled model inference service

Background

DeepSeek-R1

KServe

Arena

Prerequisites

GPU sizing

Deploy the inference service

Step 1: Prepare model files

Console

Kubectl

Step 2: Deploy the inference service

Step 3: Verify the deployment

Observability

Import a Grafana dashboard

Production enhancements

Elastic scaling

Model acceleration

Phased release

GPU sharing

What's next