All Products
Search
Document Center

Container Compute Service:Deploy a DeepSeek-R1 model inference service with ACS GPU

Last Updated:Jan 30, 2026

Container Compute Service (ACS) provides GPU-accelerated compute power without requiring deep knowledge of underlying hardware or GPU node management. ACS offers simple deployment and pay-as-you-go billing, making it ideal for large language model (LLM) inference tasks while effectively reducing costs. This topic describes how to deploy a production-ready DeepSeek-R1 full model inference service using ACS.

Background

DeepSeek-R1 model

DeepSeek-R1 is DeepSeek's first-generation reasoning model, designed to enhance reasoning capabilities in LLMs through large-scale reinforcement learning. Experimental results show that DeepSeek-R1 excels in mathematical reasoning, programming competitions, and other tasks, not only outperforming other closed-source models but also matching or exceeding the OpenAI-O1 series in certain tasks. DeepSeek-R1 also performs well in knowledge-based tasks and other task types, including creative writing and general question and answer. For more information about DeepSeek models, see the DeepSeek AI GitHub repository.

vLLM

vLLM is an efficient, easy-to-use inference framework for LLMs that supports various popular models, including Qwen. vLLM achieves high inference efficiency through optimizations such as PagedAttention, continuous batching, and model quantization. For more information about vLLM, see the vLLM GitHub repository.

Container Compute Service (ACS)

ACS provides accessible, flexible, and elastic next-generation container compute power. As a Kubernetes-based container service, ACS offers general-purpose and heterogeneous compute power that conforms to container specifications. ACS delivers compute power in a serverless model, eliminating the need to manage underlying nodes or clusters. Through integrated scheduling, container runtime, storage, and networking capabilities, ACS reduces operational complexity while optimizing elasticity and flexibility. Pay-as-you-go billing and elastic scaling capabilities significantly reduce resource costs. For LLM inference scenarios, ACS data and image acceleration capabilities further optimize model startup time and resource costs.

Prerequisites

Before you begin:

  1. Create an ACS cluster with the default service role assigned. See Create an ACS cluster.

  2. Configure kubectl to connect to your ACS cluster. See Obtain a cluster kubeconfig file and use kubectl to connect to the cluster.

GPU instance specifications and cost estimation

Deploying the DeepSeek-R1 full model on ACS requires 16 GPU hours without any acceleration. We recommend the following resource configuration for single-instance deployment:

Resource

Specification

GPU

16 cards (96 GiB memory per card)

CPU

64 vCPU

Memory

512 GiB

For information about selecting instance specifications, see GPU support in ACS. For billing information, see Billing overview.

Note
  • ACS GPU instance specifications follow the ACS Pod specification logic.

  • ACS Pods provide 30 GiB of free ephemeral storage by default. The inference image used in this topic requires more storage space. To increase ephemeral storage, see Overview of ACS pod instances.

Step 1: Prepare the DeepSeek-R1-GPTQ-INT8 model files

LLMs require substantial disk space due to their large parameter counts. We recommend using NAS or OSS storage volumes to persistently store model files. The following steps use OSS storage as an example.

Note: Submit a ticket to obtain the model files and YAML deployment configuration, including:

  • Model files: DeepSeek-R1-GPTQ-INT8

  • GPU model: Replace the label alibabacloud.com/gpu-model-series with the actual GPU model supported by ACS

  • Base image: Replace the container image with the actual image address

  • Image pull secret: Create a Secret and replace the imagePullSecrets name with the actual Secret name

(Optional) Upload the model to OSS

If you downloaded the model files locally, create a directory in OSS and upload the model:

ossutil mkdir oss://<your-bucket-name>/models/DeepSeek-R1-GPTQ-INT8
ossutil cp -r /mnt/models/DeepSeek-R1-GPTQ-INT8 oss://<your-bucket-name>/models/DeepSeek-R1-GPTQ-INT8

For information about installing and using ossutil, see Install ossutil.

Create a PV and PVC

Create a PersistentVolume (PV) and PersistentVolumeClaim (PVC) named llm-model for your cluster. For more information, see Use an ossfs 1.0 statically provisioned volume.

Console

Configure the PV with the following settings:

Parameter

Value

Storage Volume Type

OSS

Name

llm-model

Access Credentials

AccessKey ID and AccessKey Secret for OSS access

Bucket ID

Select your OSS bucket

OSS Path

/models/DeepSeek-R1-GPTQ-INT8

Configure the PVC with the following settings:

Parameter

Value

Storage Claim Type

OSS

Name

llm-model

Allocation Mode

Existing Storage Volume

Existing Storage Volume

Select the PV you created

kubectl

apiVersion: v1
kind: Secret
metadata:
  name: oss-secret
stringData:
  akId: <your-oss-ak>
  akSecret: <your-oss-sk>
---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: llm-model
  labels:
    alicloud-pvname: llm-model
spec:
  capacity:
    storage: 30Gi
  accessModes:
    - ReadOnlyMany
  persistentVolumeReclaimPolicy: Retain
  csi:
    driver: ossplugin.csi.alibabacloud.com
    volumeHandle: llm-model
    nodePublishSecretRef:
      name: oss-secret
      namespace: default
    volumeAttributes:
      bucket: <your-bucket-name>
      url: <your-bucket-endpoint>
      otherOpts: "-o umask=022 -o max_stat_cache_size=0 -o allow_other"
      path: /models/DeepSeek-R1-GPTQ-INT8/
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: llm-model
spec:
  accessModes:
    - ReadOnlyMany
  resources:
    requests:
      storage: 30Gi
  selector:
    matchLabels:
      alicloud-pvname: llm-model

Step 2: Deploy the model using ACS GPU

Run the following command to deploy the DeepSeek-R1-GPTQ-INT8 model inference service using the vLLM framework with RDMA acceleration. The inference service exposes an OpenAI-compatible HTTP API.

Note:

  • The max-model-len parameter sets the maximum token length the model can process. Increasing this value improves conversation quality but requires more GPU memory. For DeepSeek-R1-GPTQ-INT8, we recommend a maximum context length of around 128,000 tokens.

  • RDMA (Remote Direct Memory Access) enables zero-copy and kernel bypass, achieving lower latency, higher throughput, and lower CPU usage compared to TCP/IP. To use RDMA, add the label alibabacloud.com/hpn-type: rdma. Submit a ticket to confirm which GPU models support RDMA.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: deepseek-r1
  namespace: default
  labels:
    app: deepseek-r1
spec:
  replicas: 1
  selector:
    matchLabels:
      app: deepseek-r1
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
  template:
    metadata:
      labels:
        app: deepseek-r1
        alibabacloud.com/compute-class: gpu
        alibabacloud.com/gpu-model-series: <example-model>
        alibabacloud.com/hpn-type: "rdma"
    spec:
      imagePullSecrets:
        - name: <your-secret-name>
      containers:
        - name: llm-ds-r1
          image: <your-image-address>
          imagePullPolicy: IfNotPresent
          command:
            - sh
            - -c
            - "vllm serve /data/DeepSeek-R1-GPTQ-INT8 --port 8000 --trust-remote-code --served-model-name ds --max-model-len 128000 --quantization moe_wna16 --gpu-memory-utilization 0.98 --tensor-parallel-size 16"
          resources:
            limits:
              alibabacloud.com/gpu: "16"
              cpu: "64"
              memory: 512Gi
            requests:
              alibabacloud.com/gpu: "16"
              cpu: "64"
              memory: 512Gi
          volumeMounts:
            - name: llm-model
              mountPath: /data/DeepSeek-R1-GPTQ-INT8
            - name: shm
              mountPath: /dev/shm
      restartPolicy: Always
      terminationGracePeriodSeconds: 30
      volumes:
        - name: llm-model
          persistentVolumeClaim:
            claimName: llm-model
        - name: shm
          emptyDir:
            medium: Memory
            sizeLimit: 32Gi
---
apiVersion: v1
kind: Service
metadata:
  name: deepseek-r1
spec:
  type: ClusterIP
  selector:
    app: deepseek-r1
  ports:
    - protocol: TCP
      port: 8000
      targetPort: 8000

Apply the configuration:

kubectl apply -f deepseek-r1-deployment.yaml

Step 3: Verify the inference service

  1. Set up port forwarding between the inference service and your local environment:

    kubectl port-forward svc/deepseek-r1 8000:8000

    Expected output:

    Forwarding from 127.0.0.1:8000 -> 8000
    Forwarding from [::1]:8000 -> 8000

    Important: Port forwarding with kubectl is suitable for development and debugging only. It does not provide production-level reliability, security, or scalability. For production environments, see Quick Start for ALB Ingress.

  2. Send a test inference request:

    curl http://localhost:8000/v1/chat/completions \
      -H "Content-Type: application/json" \
      -d '{
        "model": "ds",
        "messages": [
          {
            "role": "user",
            "content": "Explain the concept of reinforcement learning in simple terms."
          }
        ],
        "max_tokens": 1024,
        "temperature": 0.7,
        "top_p": 0.9,
        "seed": 10
      }'

    A successful response indicates that the inference service is running correctly.

Result

You have successfully deployed a DeepSeek-R1 full model inference service on ACS with GPU acceleration. The service exposes an OpenAI-compatible API that you can integrate with your applications.