All Products
Search
Document Center

Container Service for Kubernetes:Deploy a DeepSeek distilled model inference service on ACK

Last Updated:Feb 02, 2026

This topic describes how to use KServe to deploy a production-ready DeepSeek model inference service in Alibaba Cloud Container Service for Kubernetes (ACK).

Background information

DeepSeek-R1 model

DeepSeek-R1 is the first-generation inference model from DeepSeek. It is designed to improve the inference capabilities of large language models (LLMs) through large-scale reinforcement learning. Experiments show that DeepSeek-R1 performs well on multiple tasks, such as mathematical reasoning and programming competitions. It not only surpasses other closed-source models but also approaches or exceeds the OpenAI-O1 series on certain tasks. DeepSeek-R1 also excels in knowledge-based tasks and other broad task types, including creative writing and general Q&A. DeepSeek also distills inference capabilities into smaller models. It improves the inference performance of existing models, such as Qwen and Llama, through fine-tuning. The distilled 14B model significantly surpasses the existing open-source QwQ-32B model. The distilled 32B and 70B models have set new records. For more information about the DeepSeek model, see the DeepSeek AI GitHub repository.

KServe

KServe is an open-source, cloud-native model service platform. It simplifies the process of deploying and running machine learning models on Kubernetes. It supports multiple machine learning frameworks and provides elastic scaling capabilities. KServe uses simple YAML files to provide declarative APIs for model deployment. This makes it easier to configure and manage model services. For more information about the KServe open source project, see KServe.

Arena

Arena is a lightweight, Kubernetes-based solution for machine learning. It supports the complete machine learning lifecycle, including data preparation, model development, model training, and model prediction, to improve the efficiency of data scientists. Arena is deeply integrated with Alibaba Cloud's basic cloud services and supports services such as GPU sharing and CPFS. You can run deep learning frameworks optimized by Alibaba Cloud to maximize the performance and cost-effectiveness of Alibaba Cloud's heterogeneous devices. For more information about Arena, see the Arena GitHub repository.

Prerequisites

GPU instance specifications and cost estimation

Model parameters are the main consumer of GPU memory during the inference phase. You can calculate the required GPU memory using the following formula:

For example, consider a 7B model with a default precision of FP16. The number of model parameters is 7 billion. The number of bytes for the data type precision is 2 bytes (16-bit floating-point number / 8 bits per byte).

In addition to the GPU memory required to load the model, you must also consider the KV Cache size and GPU utilization during computation. A buffer is typically reserved. Therefore, we recommend that you use a GPU-accelerated instance with 24 GiB of GPU memory, such as ecs.gn7i-c8g1.2xlarge or ecs.gn7i-c16g1.4xlarge. For more information about GPU-accelerated instance types and billing, see GPU-accelerated computed optimized instance family and Elastic GPU Service billing.

Model deployment

Step 1: Prepare the DeepSeek-R1-Distill-Qwen-7B model files

  1. Run the following command to download the DeepSeek-R1-Distill-Qwen-7B model from ModelScope.

    Note

    Make sure that the git-lfs plug-in is installed. If it is not installed, you can run yum install git-lfs or apt-get install git-lfs. For more information about installation methods, see Install Git Large File Storage.

    git lfs install
    GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B.git
    cd DeepSeek-R1-Distill-Qwen-7B/
    git lfs pull
  2. Create a directory in OSS and upload the model to OSS.

    Note

    For more information about how to install and use ossutil, see Install ossutil.

    ossutil mkdir oss://<your-bucket-name>/models/DeepSeek-R1-Distill-Qwen-7B
    ossutil cp -r ./DeepSeek-R1-Distill-Qwen-7B oss://<your-bucket-name>/models/DeepSeek-R1-Distill-Qwen-7B
  3. Create a persistent volume (PV) and a persistent volume claim (PVC). Configure a PV and a PVC named llm-model for the target cluster. For more information, see Use ossfs 1.0 statically provisioned volumes.

    Console example

    The following table describes the basic configurations for the sample PV.

    Configuration item

    Description

    PV Type

    OSS

    Name

    llm-model

    Access Certificate

    Configure the AccessKey ID and AccessKey secret to access OSS.

    Bucket ID

    Select the OSS bucket that you created in the previous step.

    OSS Path

    Select the path where the model is stored, such as /models/DeepSeek-R1-Distill-Qwen-7B.

    The following table describes the basic configurations for the sample PVC.

    Configuration item

    Description

    PVC Type

    OSS

    Name

    llm-model

    Allocation Mode

    Select an existing PV.

    Existing Volumes

    Click the link to select an existing PV, and then select the PV that you created.

    kubectl example

    The following is a sample YAML file:

    apiVersion: v1
    kind: Secret
    metadata:
      name: oss-secret
    stringData:
      akId: <your-oss-ak> # The AccessKey ID to access OSS.
      akSecret: <your-oss-sk> # The AccessKey secret to access OSS.
    ---
    apiVersion: v1
    kind: PersistentVolume
    metadata:
      name: llm-model
      labels:
        alicloud-pvname: llm-model
    spec:
      capacity:
        storage: 30Gi 
      accessModes:
        - ReadOnlyMany
      persistentVolumeReclaimPolicy: Retain
      csi:
        driver: ossplugin.csi.alibabacloud.com
        volumeHandle: llm-model
        nodePublishSecretRef:
          name: oss-secret
          namespace: default
        volumeAttributes:
          bucket: <your-bucket-name> # The bucket name.
          url: <your-bucket-endpoint> # The Endpoint, such as oss-cn-hangzhou-internal.aliyuncs.com.
          otherOpts: "-o umask=022 -o max_stat_cache_size=0 -o allow_other"
          path: <your-model-path> # In this example, the path is /models/DeepSeek-R1-Distill-Qwen-7B/.
    ---
    apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      name: llm-model
    spec:
      accessModes:
        - ReadOnlyMany
      resources:
        requests:
          storage: 30Gi
      selector:
        matchLabels:
          alicloud-pvname: llm-model

Step 2: Deploy the inference service

  1. Run the following command to start the inference service named deepseek.

    arena serve kserve \
        --name=deepseek \
        --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm:v0.6.6 \
        --gpus=1 \
        --cpu=4 \
        --memory=12Gi \
        --data=llm-model:/models/DeepSeek-R1-Distill-Qwen-7B \
        "vllm serve /models/DeepSeek-R1-Distill-Qwen-7B --port 8080 --trust-remote-code --served-model-name deepseek-r1 --max-model-len 32768 --gpu-memory-utilization 0.95 --enforce-eager"

    The following table describes the parameters.

    Parameter

    Required

    Description

    --name

    Yes

    The name of the inference service to submit. The name must be globally unique.

    --image

    Yes

    The image address of the inference service.

    --gpus

    No

    The number of GPUs required by the inference service. The default value is 0.

    --cpu

    No

    The number of CPUs required by the inference service.

    --memory

    No

    The amount of memory required by the inference service.

    --data

    No

    The model path of the service. In this topic, the model is specified as the llm-model PV that you created in the previous step. The PV is mounted to the /models/ directory in the container.

    Expected output:

    inferenceservice.serving.kserve.io/deepseek created
    INFO[0003] The Job deepseek has been submitted successfully
    INFO[0003] You can run `arena serve get deepseek --type kserve -n default` to check the job status

Step 3: Verify the inference service

  1. Run the following command to check the deployment status of the KServe inference service.

    arena serve get deepseek

    Expected output:

    Name:       deepseek
    Namespace:  default
    Type:       KServe
    Version:    1
    Desired:    1
    Available:  1
    Age:        3m
    Address:    http://deepseek-default.example.com
    Port:       :80
    GPU:        1
    
    
    Instances:
      NAME                                 STATUS   AGE  READY  RESTARTS  GPU  NODE
      ----                                 ------   ---  -----  --------  ---  ----
      deepseek-predictor-7cd4d568fd-fznfg  Running  3m   1/1    0         1    cn-beijing.172.16.1.77

    The output indicates that the KServe inference service is deployed.

  2. Run the following command to use the IP address of the NGINX Ingress gateway to access the inference service.

    # Obtain the IP address of the NGINX Ingress.
    NGINX_INGRESS_IP=$(kubectl -n kube-system get svc nginx-ingress-lb -ojsonpath='{.status.loadBalancer.ingress[0].ip}')
    # Obtain the hostname of the inference service.
    SERVICE_HOSTNAME=$(kubectl get inferenceservice deepseek -o jsonpath='{.status.url}' | cut -d "/" -f 3)
    # Send a request to access the inference service.
    curl -H "Host: $SERVICE_HOSTNAME" -H "Content-Type: application/json" http://$NGINX_INGRESS_IP:80/v1/chat/completions -d '{"model": "deepseek-r1", "messages": [{"role": "user", "content": "Say this is a test!"}], "max_tokens": 512, "temperature": 0.7, "top_p": 0.9, "seed": 10}'

    Expected output:

    {"id":"chatcmpl-0fe3044126252c994d470e84807d4a0a","object":"chat.completion","created":1738828016,"model":"deepseek-r1","choices":[{"index":0,"message":{"role":"assistant","content":"<think>\n\n</think>\n\nIt seems like you're testing or sharing some information. How can I assist you further? If you have any questions or need help with something, feel free to ask!","tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":9,"total_tokens":48,"completion_tokens":39,"prompt_tokens_details":null},"prompt_logprobs":null}

Observability

Observability for LLM inference services in a production environment is crucial for proactively finding and resolving issues. The vLLM framework provides many LLM inference metrics. For more information, see the Metrics document. KServe also provides metrics to help monitor the performance and health of model services. These capabilities are integrated into Arena. You can add the --enable-prometheus=true parameter when you submit the application to enable them.

arena serve kserve \
    --name=deepseek \
    --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm:v0.6.6 \
    --gpus=1 \
    --cpu=4 \
    --memory=12Gi \
    --enable-prometheus=true \
    --data=llm-model:/models/DeepSeek-R1-Distill-Qwen-7B \
    "vllm serve /models/DeepSeek-R1-Distill-Qwen-7B --port 8080 --trust-remote-code --served-model-name deepseek-r1 --max-model-len 32768 --gpu-memory-utilization 0.95 --enforce-eager"

You can use a Grafana dashboard to monitor the LLM inference service deployed with vLLM. To do so, import the vLLM Grafana JSON model into Grafana and then create an observability dashboard for the LLM inference service. You can obtain the JSON model from the vLLM official website. The configured dashboard appears as follows:

image

Procedure for importing a Grafana dashboard

Import the dashboard

  1. Log on to the ARMS console.

  2. In the left navigation pane, click Integration Management.

  3. On the Integrated Environments tab, select Container Service, search for the environment by ACK cluster name, and then click the target environment.

    image

  4. On the Component Management tab, copy the Cluster ID, and then click the link next to Dashboard Directory.

    image

  5. To the right of the Dashboards tab, click the Import button.

    image

  6. Copy the contents of the grafana.json file, paste it into the Import via panel json area, and then click the Load button.

    Note

    You can also import the dashboard by uploading the JSON file.

    image

  7. Keep the default settings and click Import to import the LLM inference service observability dashboard.

Verify the dashboard data

  1. Search for the data source using the copied cluster ID or the Prometheus instance ID, and then select the target data source.

    image

  2. Send several requests to access the inference service to simulate service traffic and verify the data on the LLM inference service observability dashboard, such as Token Throughput.

    image

Elastic Scaling

When you deploy and manage KServe model services, you may experience dynamic load fluctuations. KServe uses the Kubernetes Horizontal Pod Autoscaler (HPA) and the ack-alibaba-cloud-metrics-adapter component from ACK to automatically scale the number of model service pods based on CPU, memory, and GPU utilization, along with custom performance metrics, to ensure service stability and efficiency. For more information, see Configure auto scaling for a service.

Model acceleration

With the development of technology, the size of models used in AI applications is increasing. When you pull large files from storage services such as Object Storage Service (OSS) and File Storage NAS (NAS), issues such as high latency or cold starts may occur. You can use Fluid to significantly accelerate model loading speed and optimize the performance of inference services, especially KServe-based inference services. For more information, see Use Fluid for model acceleration.

Phased release

Phased release is a critical strategy in production environments to ensure business stability and minimize risks associated with changes. ACK supports various phased release policies, including traffic percentage-based and request header-based approaches. For more information, see Implement phased release for inference services.

GPU sharing inference

The DeepSeek-R1-Distill-Qwen-7B model requires only 14 GB of GPU memory. If you use a higher-spec GPU, consider using GPU sharing inference technology to improve GPU utilization. This technology partitions a GPU so that multiple inference services can share it, which improves overall GPU utilization. For more information, see Deploy GPU sharing inference services.

References