All Products
Search
Document Center

Container Compute Service:Build a QwQ-32B model inference service using ACS GPU computing power

Last Updated:Mar 26, 2026

Container Compute Service (ACS) lets you deploy LLM inference services without managing GPU-accelerated nodes or configuring underlying hardware. All configurations are ready to use out of the box, and billing is pay-as-you-go. This topic describes how to deploy a QwQ-32B inference service on ACS using vLLM, and how to access it through Open WebUI.

Background

QwQ-32B

Alibaba Cloud's QwQ-32B model features 32 billion parameters. In math and coding benchmarks (AIME 24/25 and LiveCodeBench) and general performance indicators, QwQ-32B matches the DeepSeek-R1 full version (671 billion parameters) and surpasses DeepSeek-R1-Distill-Qwen-32B developed based on Qwen2.5-32B. For more information, see QwQ-32B.

vLLM

vLLM is a high-performance LLM inference framework that supports most commonly used LLMs, including the Qwen series. It uses PagedAttention optimization, continuous batching, and model quantization to improve inference efficiency. See vLLM GitHub repository.

Open WebUI

Open WebUI is an extensible, self-hosted AI platform designed to run offline. It supports LLM frameworks such as Ollama and APIs compatible with OpenAI, and includes a built-in inference engine for Retrieval-Augmented Generation (RAG).

Prerequisites

Before you begin, make sure that you have:

GPU instance specification and estimated cost

GPU memory is consumed by model parameters during inference. Use the following formula to estimate requirements:

*GPU memory = 32 × 10^9 × 2 bytes ≈ 59.6 GiB*

The calculation uses the model's 32 billion parameters at 16-bit floating-point precision (2 bytes per value). Beyond loading the model weights, you also need GPU memory for the KV cache and GPU utilization headroom. The suggested specification is 1 GPU with 80 GiB of memory, 16 vCPUs, and 128 GiB of memory. For a full list, see the Table of suggested specifications and GPU models and specifications. For billing details, see Billing overview.

Make sure that the specification complies with ACS pod specification adjustment logic.
By default, an ACS pod provides 30 GiB of free ephemeral storage. The inference image inference-nv-pytorch:25.02-vllm0.7.2-sglang0.4.3.post2-pytorch2.5-cuda12.4-20250305-serverless is 9.5 GiB in size. If you need more storage space, customize the ephemeral storage size. See Add the EphemeralStorage.

Deploy QwQ-32B on ACS

Submit a ticket to copy the QwQ-32B model files (~120 GiB) directly to your OSS bucket, bypassing the 2–3 hour download and upload process. You can also use the ticket to check supported GPU models.
GPU model: Replace <example-model> in the alibabacloud.com/gpu-model-series: <example-model> label with your actual GPU model. See Specify GPU models and driver versions for ACS GPU-accelerated pods.
RDMA: RDMA (Remote Direct Memory Access) uses zero-copy and kernel bypass to reduce latency and CPU usage while increasing throughput compared to TCP/IP. Add the alibabacloud.com/hpn-type: "rdma" label to enable RDMA. For supported GPU models, submit a ticket or see High-performance RDMA networks.

Step 1: Prepare model data

Large model files require persistent storage to avoid re-downloading on pod restarts. Storing the QwQ-32B model (~120 GiB) in OSS separates storage from compute, simplifies model updates, and eliminates time-consuming downloads at startup. The model is mounted directly into the inference container at runtime.

  1. Download the QwQ-32B model.

    Check whether the git-lfs plugin is installed. If not, run yum install git-lfs or apt-get install git-lfs to install it. See Install git-lfs.
    git lfs install
    GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/Qwen/QwQ-32B.git
    cd QwQ-32B
    git lfs pull
  2. Create an OSS directory and upload the model files. Uploading ~120 GiB typically takes 2–3 hours.

    To install and use ossutil, see Install ossutil.
    ossutil mkdir oss://<your-bucket-name>/models/QwQ-32B
    ossutil cp -r ./QwQ-32B oss://<your-bucket-name>/models/QwQ-32B
  3. Create a PersistentVolume (PV) and PersistentVolumeClaim (PVC) named llm-model. For more information, see Mount a statically provisioned OSS volume.

    Use the console

    The following table lists the parameters for creating the PV.

    ParameterValue
    PV TypeOSS
    Volume Namellm-model
    Access CertificateThe AccessKey ID and AccessKey secret for accessing the OSS bucket
    Bucket IDThe OSS bucket you created
    OSS PathThe model path, such as /models/QwQ-32B

    The following table lists the parameters for creating the PVC.

    ParameterValue
    PVC TypeOSS
    Volume Namellm-model
    Allocation ModeExisting Volumes
    Existing VolumesSelect the PV you created

    Use kubectl

    Apply the following YAML manifest:

    apiVersion: v1
    kind: Secret
    metadata:
      name: oss-secret
    stringData:
      akId: <your-oss-ak>      # The AccessKey ID used to access the OSS bucket.
      akSecret: <your-oss-sk>  # The AccessKey secret used to access the OSS bucket.
    ---
    apiVersion: v1
    kind: PersistentVolume
    metadata:
      name: llm-model
      labels:
        alicloud-pvname: llm-model
    spec:
      capacity:
        storage: 30Gi
      accessModes:
        - ReadOnlyMany
      persistentVolumeReclaimPolicy: Retain
      csi:
        driver: ossplugin.csi.alibabacloud.com
        volumeHandle: llm-model
        nodePublishSecretRef:
          name: oss-secret
          namespace: default
        volumeAttributes:
          bucket: <your-bucket-name>      # The name of the OSS bucket.
          url: <your-bucket-endpoint>     # The endpoint, such as oss-cn-hangzhou-internal.aliyuncs.com.
          otherOpts: "-o umask=022 -o max_stat_cache_size=0 -o allow_other"
          path: <your-model-path>         # The model path, such as /models/QwQ-32B/.
    ---
    apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      name: llm-model
    spec:
      accessModes:
        - ReadOnlyMany
      resources:
        requests:
          storage: 30Gi
      selector:
        matchLabels:
          alicloud-pvname: llm-model

Step 2: Deploy the model

Deploy the QwQ-32B inference service using vLLM. The Deployment mounts the OSS-backed PVC to /models/QwQ-32B and starts the vLLM server, which exposes an OpenAI-compatible HTTP API on port 8000.

egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/{image:tag} is a public image address. To speed up image pulls, use VPC to accelerate the pulling of AI container images.
kubectl apply -f- <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: qwq-32b
    alibabacloud.com/compute-class: gpu
    alibabacloud.com/gpu-model-series: <example-model>
    alibabacloud.com/hpn-type: "rdma"
  name: qwq-32b
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: qwq-32b
  template:
    metadata:
      labels:
        app: qwq-32b
        alibabacloud.com/compute-class: gpu
        alibabacloud.com/gpu-model-series: <example-model>
    spec:
      volumes:
        - name: model
          persistentVolumeClaim:
            claimName: llm-model
        - name: dshm
          emptyDir:
            medium: Memory
            sizeLimit: 30Gi
      containers:
      - command:
        - sh
        - -c
        - vllm serve /models/QwQ-32B --port 8000 --trust-remote-code --served-model-name qwq-32b --max-model-len 32768 --gpu-memory-utilization 0.95 --enforce-eager
        image: egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:25.02-vllm0.7.2-sglang0.4.3.post2-pytorch2.5-cuda12.4-20250305-serverless
        name: vllm
        ports:
        - containerPort: 8000
        readinessProbe:
          tcpSocket:
            port: 8000
          initialDelaySeconds: 30
          periodSeconds: 30
        resources:
          limits:
            nvidia.com/gpu: "1"
            cpu: "16"
            memory: 128G
        volumeMounts:
          - mountPath: /models/QwQ-32B
            name: model
          - mountPath: /dev/shm
            name: dshm
---
apiVersion: v1
kind: Service
metadata:
  name: qwq-32b-v1
spec:
  type: ClusterIP
  ports:
  - port: 8000
    protocol: TCP
    targetPort: 8000
  selector:
    app: qwq-32b
EOF

The key vLLM parameters used in the command above:

ParameterDescription
--port 8000Port on which the vLLM server listens
--trust-remote-codeAllows the model to run custom code from the model repository
--served-model-name qwq-32bThe model name used in API requests
--max-model-len 32768Maximum token sequence length (input + output). Increase for longer contexts, at the cost of higher GPU memory usage
--gpu-memory-utilization 0.95Fraction of GPU memory reserved for the model and KV cache. Reserving 5% prevents out-of-memory errors
--enforce-eagerDisables CUDA graph capture and runs in eager mode. Reduces startup time and memory overhead

Step 3: Deploy Open WebUI

Deploy Open WebUI to provide a browser-based chat interface connected to the vLLM inference service.

kubectl apply -f- << EOF
apiVersion: apps/v1
kind: Deployment
metadata:
  name: openwebui
spec:
  replicas: 1
  selector:
    matchLabels:
      app: openwebui
  template:
    metadata:
      labels:
        app: openwebui
    spec:
      containers:
      - env:
        - name: ENABLE_OPENAI_API
          value: "True"
        - name: ENABLE_OLLAMA_API
          value: "False"
        - name: OPENAI_API_BASE_URL
          value: http://qwq-32b-v1:8000/v1
        - name: ENABLE_AUTOCOMPLETE_GENERATION
          value: "False"
        - name: ENABLE_TAGS_GENERATION
          value: "False"
        image: kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/open-webui:main
        name: openwebui
        ports:
        - containerPort: 8080
          protocol: TCP
        volumeMounts:
        - mountPath: /app/backend/data
          name: data-volume
      volumes:
      - emptyDir: {}
        name: data-volume
---
apiVersion: v1
kind: Service
metadata:
  name: openwebui
  labels:
    app: openwebui
spec:
  type: ClusterIP
  ports:
  - port: 8080
    protocol: TCP
    targetPort: 8080
  selector:
    app: openwebui
EOF

Step 4: Verify the inference service

  1. Set up port forwarding from your local machine to the Open WebUI service.

    Port forwarding with kubectl port-forward is intended for development and debugging only—it is not reliable, secure, or scalable in production. For production networking in ACK clusters, see Ingress management.
    kubectl port-forward svc/openwebui 8080:8080

    Expected output:

    Forwarding from 127.0.0.1:8080 -> 8080
    Forwarding from [::1]:8080 -> 8080
  2. Open http://localhost:8080 in a browser and log in to Open WebUI. The first login requires you to create an administrator username and password. After logging in, enter a prompt to verify that the model responds correctly.

    image

(Optional) Step 5: Run stress tests

The stress test dataset is downloaded from the internet. Make sure the pod has internet access. See Enable internet access for an ACS cluster or Mount an independent EIP for pods.
  1. Deploy the benchmark tool.

    kubectl apply -f- <<EOF
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: vllm-benchmark
      labels:
        app: vllm-benchmark
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: vllm-benchmark
      template:
        metadata:
          labels:
            app: vllm-benchmark
        spec:
          volumes:
          - name: llm-model
            persistentVolumeClaim:
              claimName: llm-model
          containers:
          - name: vllm-benchmark
            image: kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm-benchmark:v1
            command:
            - "sh"
            - "-c"
            - "sleep inf"
            volumeMounts:
            - mountPath: /models/QwQ-32B
              name: llm-model
    EOF
  2. Log in to the benchmark pod and download the test dataset.

    # Log in to the benchmark pod.
    PODNAME=$(kubectl get po -o custom-columns=":metadata.name"|grep "vllm-benchmark")
    kubectl exec -it $PODNAME -- bash
    
    # Download the stress test dataset.
    pip3 install modelscope
    modelscope download --dataset gliang1001/ShareGPT_V3_unfiltered_cleaned_split ShareGPT_V3_unfiltered_cleaned_split.json --local_dir /root/
  3. Run the stress test. The following command tests with input_length=4096, output_length=512, concurrency=8, and num_prompts=80.

    MetricValueDescription
    Request throughput0.17 req/sCompleted requests per second
    Output token throughput85.11 tok/sGenerated tokens per second
    Total token throughput790.18 tok/sInput + output tokens per second
    Mean TTFT10,315.97 msAverage latency before the first token is generated
    Mean TPOT71.03 msAverage time to generate each output token after the first
    Mean ITL71.02 msAverage time between consecutive tokens
    python3 /root/vllm/benchmarks/benchmark_serving.py \
    --backend vllm \
    --model /models/QwQ-32B \
    --served-model-name qwq-32b \
    --trust-remote-code \
    --dataset-name random \
    --dataset-path /root/ShareGPT_V3_unfiltered_cleaned_split.json \
    --random-input-len 4096 \
    --random-output-len 512 \
    --random-range-ratio 1 \
    --num-prompts 80 \
    --max-concurrency 8 \
    --host qwq-32b-v1 \
    --port 8000 \
    --endpoint /v1/completions \
    --save-result \
    2>&1 | tee benchmark_serving.txt

    Expected output:

    Starting initial single prompt test run...
    Initial test run completed. Starting main benchmark run...
    Traffic request rate: inf
    Burstiness factor: 1.0 (Poisson process)
    Maximum request concurrency: 8
    100%|██████████| 80/80 [07:44<00:00,  5.81s/it]
    ============ Serving Benchmark Result ============
    Successful requests:                     80
    Benchmark duration (s):                  464.74
    Total input tokens:                      327680
    Total generated tokens:                  39554
    Request throughput (req/s):              0.17
    Output token throughput (tok/s):         85.11
    Total Token throughput (tok/s):          790.18
    ---------------Time to First Token----------------
    Mean TTFT (ms):                          10315.97
    Median TTFT (ms):                        12470.54
    P99 TTFT (ms):                           17580.34
    -----Time per Output Token (excl. 1st token)------
    Mean TPOT (ms):                          71.03
    Median TPOT (ms):                        66.24
    P99 TPOT (ms):                           95.95
    ---------------Inter-token Latency----------------
    Mean ITL (ms):                           71.02
    Median ITL (ms):                         58.12
    P99 ITL (ms):                            60.26
    ==================================================

    Key metrics:

(Optional) Step 6: Clean up

Delete all workloads and storage resources when the inference service is no longer needed.

  1. Delete the inference workloads and services.

    kubectl delete deployment qwq-32b
    kubectl delete service qwq-32b-v1
    kubectl delete deployment openwebui
    kubectl delete service openwebui
    kubectl delete deployment vllm-benchmark
  2. Delete the PV and PVC.

    kubectl delete pvc llm-model
    kubectl delete pv llm-model

    Expected output:

    persistentvolumeclaim "llm-model" deleted
    persistentvolume "llm-model" deleted

What's next

References