Use Elastic Container Instance to deploy QwQ-32B - Elastic Container Instance

This topic describes how to use a DataCache to deploy the QwQ-32B model. Before you deploy the QwQ-32B model, you can pull the model data and store it in a DataCache. When you deploy the model, you can mount the model data to the pod that hosts the model. This way, Elastic Container Instance does not need to pull the model data but directly uses the DataCache when the pod starts and accelerate the deployment of the model.

Why is Elastic Container Instance used to deploy QwQ-32B

Elastic Container Instance does not require O&M and can flexibly deploy applications and help you build elastic and cost-effective business. For more information, see Benefits.
Elastic Container Instance uses DataCaches and ImageCaches to save time of image pulls and model downloads, reduce network resource consumption, and improve system efficiency.
Note
The deployment of a containerized large model inference service involves the following stages: create and start a container, pull the image, download the model file, and load and start the model. An extended period of time and a large amount of network traffic are required to pull the image and model of a large model inference service due to the large size of the image and model. Elastic Container Instance uses ImageCaches and DataCaches to save time of image pulls and model downloads.

Prerequisites

A DataCache custom resource definition (CRD) is deployed in the cluster. For more information, see Deploy a DataCache CRD.
The virtual private cloud (VPC) in which the cluster resides is associated with an Internet NAT gateway. An SNAT entry is configured for the Internet NAT gateway to allow resources in the VPC or resources connected to vSwitches in the VPC to access the Internet.
Note
If the VPC is not associated with an Internet NAT gateway, you must associate an elastic IP address (EIP) with the VPC when you create the DataCache and deploy the application. This way, you can pull data from the Internet.

Prepare a runtime environment

Recommended ECS instance types
We recommend that you use a GPU-accelerated Elastic Compute Service (ECS) instance type that provides four NVIDIA A10 or higher GPU specifications, such as ecs.gn7i-4x.8xlarge, ecs.gn7i-4x.16xlarge, and ecs.gn7i-c32g1.32xlarge. For information about the GPU-accelerated ECS instance types that you can use to create elastic container instances, see Supported instance families.
Note
When you select an ECS instance type, make sure that the instance type is supported by the region and zone where the cluster resides. For more information, see Instance Types Available for Each Region.
Software requirements
The deployment of a large model depends on a large number of libraries and configurations. vLLM is a mainstream large model inference engine and used to deploy the inference service in this topic. Elastic Container Instance provides a public container image. You can directly use the public container image or perform secondary development based on the public container image. The image is stored at registry.cn-beijing.aliyuncs.com/eci_open/vllm-openai:v0.7.2 and about 16.5 GB in size.

Step 1: Create a DataCache

First time you deploy QwQ-32B, you must create a DataCache in advance to eliminate the need to pull model data. This accelerates the deployment of QwQ-32B.

Access ModelScope to obtain the ID of the model.
In this topic, the master version of QwQ-32B is used. Find the model that you want to use in ModelScope and copy the model ID in the upper-left corner of the model details page.

Write a YAML configuration file for the DataCache. Then, use the YAML file to create the DataCache and pull the QwQ-32B model data to store in the DataCache.

kubectl create -f datacache-test.yaml

Example of the DataCache YAML configuration file that is named datacache-test.yaml.

apiVersion: eci.aliyun.com/v1alpha1
kind: DataCache
metadata:
  name: qwq-32b
spec:
  bucket: default
  path: /models/qwq-32b
  dataSource:
    type: URL
    options:
      repoSource: ModelScope/Model                    # Specify a ModelScope model as the data source.
      repoId: Qwen/QwQ-32B                            # Specify a model ID.   
  netConfig:
    securityGroupId: sg-2ze***********
    vSwitchId: vsw-2ze************
    eipCreateParam:                                  # If no SNAT entries are configured for the vSwitch to enable Internet access for the model, an elastic IP address (EIP) must be created and associated with the vSwitch.
      bandwidth: 5                                   # EIP bandwidth

Query the status of the DataCache.
```
kubectl get edc
```
After the model data is downloaded and the status of the DataCache becomes Available, the DataCache is ready for use. Alibaba Cloud provides the hot load capability for QwQ-32B, which makes a DataCache be created within seconds.
```
NAME      AGE   DATACACHEID                STATUS      PROGRESS   BUCKET    PATH
qwq-32b   21s   edc-2ze2qu723g5arodr****   Available   100%       default   /models/qwq-32b
```

Step 2: Deploy the QwQ-32B model inference service

Write a YAML configuration file for the QwQ-32B application and then deploy the application based on the YAML file.

kubectl create -f qwq-32b-server.yaml

The following sample code provides an example of the content of the qwq-32b-server.yaml file. The pod is based on GPU-accelerated ECS instance types and is mounted with the QwQ-32B model. The container in the pod uses an image that contains vLLM. After the container is started, the container runs vllm serve /models/QwQ-32B --port 8000 --trust-remote-code --served-model-name qwq-32b --tensor-parallel=4 --max-model-len 8192 --gpu-memory-utilization 0.95 --enforce-eager to start OpenAI-Compatible Server.

Note

In the following YAML file, the system automatically creates an EIP and associates the EIP with the pod. If the virtual private cloud (VPC) to which your cluster belongs is associated with an Internet NAT gateway, you can remove the annotation about the EIP. After the pod is created, you can configure DNAT entries to allow external access to the pod. We recommend that you create a Service to provide centralized access to the pod.

apiVersion: v1
kind: Pod
metadata:
  name: qwq-32b-server
  labels:
    alibabacloud.com/eci: "true"
  annotations:
    k8s.aliyun.com/eci-use-specs: ecs.gn7i-4x.8xlarge,ecs.gn7i-4x.16xlarge,ecs.gn7i-c32g1.32xlarge  # Specify GPU-accelerated ECS instance types. You can specify multiple ECS instance types to improve the creation success rate of the pod.
    k8s.aliyun.com/eci-gpu-driver-version: tesla=535.161.08   # Specify the version of the GPU driver.
    k8s.aliyun.com/eci-with-eip: "true"                       # Specify whether to automatically create an EIP and associate the EIP with the pod to allow external access to the pod.
    k8s.aliyun.com/eci-extra-ephemeral-storage: "20Gi"        # Specify an additional storage space because the startup of the pod depends on a large framework. You are charged for the additional storage space.
    k8s.aliyun.com/eci-data-cache-bucket: "default"           # Specify the bucket in which you want to store the DataCache.
    # If you require a higher loading speed, you can use an AutoPL disk.
    k8s.aliyun.com/eci-data-cache-provisionedIops: "15000"   # Specify the IOPS that is provisioned for the enhanced SSD (ESSD) AutoPL disk.
    k8s.aliyun.com/eci-data-cache-burstingEnabled: "true"    # Enable the performance burst feature for the ESSD AutoPL disk to accelerate the startup of the application.
spec:
  containers:
    - name: vllm-container
      command:
        - /bin/sh
      args:
        - -c
        - vllm serve /models/QwQ-32B --port 8000 --trust-remote-code --served-model-name qwq-32b --tensor-parallel=4 --max-model-len 8192 --gpu-memory-utilization 0.95 --enforce-eager
      image: registry-vpc.cn-beijing.aliyuncs.com/eci_open/vllm-openai:v0.7.2
      imagePullPolicy: IfNotPresent
      readinessProbe:
        tcpSocket:
          port: 8000
        initialDelaySeconds: 500
        periodSeconds: 5
      resources:
        limits:
          nvidia.com/gpu: "4"
      volumeMounts:
        - mountPath: /models/QwQ-32B     # Specify the mount path of the model data in the container.
          name: llm-model
        - mountPath: /dev/shm
          name: dshm
  volumes:
    - name: llm-model
      hostPath:
        path: /models/qwq-32b  # Specify the mount path of the model data or the DataCache
    - name: dshm
      emptyDir:
        medium: Memory
        sizeLimit: 30Gi

Check whether the application is deployed.

kubectl get pod

Expected output:

NAME             READY   STATUS    RESTARTS   AGE
qwq-32b-server   1/1     Running   0          2m55s

Check the EIP that is associated with the pod.

kubectl describe pod qwq-32b-server

You can obtain the EIP that is associated with the pod in the Annotations section of the returned pod details. Example:

Name:             qwq-32b-server
Namespace:        default
Priority:         0
Service Account:  default
Node:             virtual-kubelet-cn-beijing-i/10.2.0.81
Start Time:       Wed, 12 Mar 2025 02:42:39 +0000
Labels:           alibabacloud.com/eci=true
Annotations:      ProviderCreate: done
                  k8s.aliyun.com/allocated-eipAddress: 182.92.XX.XX
......

Step 3: Test the inference service of the model

Add an inbound rule to the security group to which the pod belongs and enable port 8000.

Send a request to the QwQ-32B model inference service.

In this example, an EIP is associated with QwQ-32B. Replace XX.XX.XX.XX in the sample code with your actual EIP.

curl POST http://XX.XX.XX.XX:8000/v1/chat/completions \
     -H "Content-Type: application/json" \
     -d '{
           "model": "qwq-32b",
           "messages": [
               {
                   "role": "user",
                   "content": "Briefly describe cloud computing in one sentence"
               }
           ],
           "temperature": 0.6,
           "max_tokens": 3000
         }' \
     --verbose

Expected output:

{"id":"chatcmpl-7678099ec2b24852bc5b96b1227b0010","object":"chat.completion","created":1741747813,"model":"qwq-32b","choices":[{"index":0,"message":{"role":"assistant","reasoning_content":null,"content":" OK, the user asks me to briefly describe cloud computing in one sentence. I need to determine what the user needs. Maybe they need to quickly understand the basic concepts of cloud computing without going too deep into the technical details. So, I need to grasp the core features of cloud computing, such as resource virtualization, on-demand services, and elastic scaling. Then I need to make sure that the sentence is concise and easy to understand. \n\n Next, I have to think about the common definition of cloud computing. It usually refers to the provision of computing resources over the Internet, such as servers, storage, and databases, and use of computing resources and payment for computing resource on demand. It may also refer to elastic scaling, which is the automatic adjustment of resources according to demand. In addition, I think about the service models of cloud computing, such as IaaS, PaaS and SaaS, but they may not need to be mentioned in one sentence. They are too complicated. A deeper need that the user may not have spoken about is that they may want to understand the benefits of cloud computing, such as cost saving, flexibility, or how it differs from traditional IT. So these benefits need to be implied in the sentence. For example, keywords such as "no upfront investment required" or "flexible scaling" must be included. \n\n Then pay attention to avoid technical terms and make the sentence more common. For example, use "over the Internet" instead of "based on Internet infrastructure", or "on-demand access" instead of "on-demand self-service". Also make sure the sentence structure is smooth and the information is complete. \n \n I may also need to compare cloud computing with traditional methods. For example, users used to need to buy their own servers, but now they can rent cloud services. The comparison makes cloud computing more intuitive. But one sentence may not include the comparison, so I can use "no need to build your own infrastructure" to imply this. \n\n Check whether there are any missing key points: resource type (compute, storage, network), delivery method (Internet), on-demand service, elasticity, and pay-per-use. If all these are covered, it should be no problem. \n\n Finally, I need to combine the information into one sentence and make sure the information is accurate. For example: "Cloud computing is a model that provides on-demand access to scalable computing resources (such as servers, storage, and applications) through the Internet. Users can flexibly acquire and manage resources without building their own infrastructure, and pay according to actual usage." This should cover the main elements and be concise. \n</think>\n \n Cloud computing is a model that provides scalable computing resources (such as servers, storage, networks, and applications) over the Internet. Users can access and pay for the actual usage without the need to build their own infrastructure. ","tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":15,"total_tokens":450,"completion_tokens":435,"prompt_tokens_details":null},"prompt_logprobs":null}

The content before </think> represents the thinking process or inference steps before the model generates the final answer. These markers are not a part of the final output, but a record of the self-prompting or logical inference of the model before the model generates the answer.

Extracted final answer:

Cloud computing is a model that provides scalable computing resources (such as servers, storage, networks, and applications) over the Internet. Users can access and pay for the actual usage without the need to build their own infrastructure.