ACK Auto Mode clusters support Auto Mode node pools. Combined with Knative Serving on-demand elasticity, you can deploy the Qwen3.5-4B large language model as an on-demand Serverless inference service. After deployment, no manual GPU resource management is required, making this suitable for cost-sensitive model inference scenarios with low operational complexity.
The workflow combines two mechanisms:
-
The Auto Mode node pool manages GPU node creation and release.
-
Knative Serving scales pods based on request concurrency (
concurrency) or requests per second (rps).
Step 1: Create an ACK Auto Mode cluster and GPU node pool
1. Create a cluster
Log on to the ACK console. In the left navigation pane, click Clusters.
-
On the Clusters page, click Create Kubernetes Cluster. On the ACK Managed Cluster page, enable Auto Mode.
After you enable this mode, the page displays the three core capabilities of Auto Mode: fully managed operations (fully managed control plane, automatic version upgrades, and maintenance-free nodes with auto-healing), automatic node scaling (on-demand elastic scaling, automatic instance type matching, and optimized resource costs), and highly optimized node operating system (container-optimized OS for fast startup, immutable file system, and security best practices by default).
-
Configure the settings and click Create Kubernetes Cluster.
See Create an ACK Auto Mode cluster.
2. Create a GPU node pool
On the ACK Clusters page, click the name of your cluster. In the left navigation pane, click .
-
On the Node Pools page, click Create Node Pool and configure the node pool in the Create Node Pool dialog box.
Key parameters (see Create a node pool for all options):
-
Configure Managed Node Pool: Use intelligent management mode.
-
Instance-related configurations: For Instance Configuration Mode, select Specify Instance Type. Then select a GPU instance type such as V100, A10, or T4.
-
Node Labels: Add the label
ack.aliyun.com/nvidia-driver-version:550.144.03to set the NVIDIA driver version to 550.144.03. -
Container Image Acceleration: Enable to reduce model image pull time.
-
3. Deploy Knative components
Step 2: Prepare model files and upload to OSS
Download Qwen3.5-4B from ModelScope to a temporary ECS instance, upload to OSS with ossutil, and mount the bucket path as a persistent volume to avoid repeated downloads on pod restarts.
Before you begin:
-
An OSS bucket is created.
-
ossutil is installed and configured on the temporary ECS instance.
1. Download Qwen3.5-4B model files
Run the following commands on the temporary ECS instance.
-
Install Git.
# You can run 'yum install git' or 'apt install git' to install it. sudo yum install git -
Install Git LFS (Large File Storage).
# You can run 'yum install git-lfs' or 'apt install git-lfs' to install it. sudo yum install git-lfs -
Clone the Qwen3.5-4B repository from ModelScope, skipping LFS files.
GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/qwen/Qwen3.5-4B.git -
Enter the directory and pull the LFS-managed files.
cd Qwen3.5-4B git lfs pull
2. Upload model files to OSS
-
Create a model directory in your OSS bucket.
Replace
<Your-Bucket-Name>with your bucket name.ossutil mkdir oss://<Your-Bucket-Name>/models/Qwen3.5-4B -
Upload the model files to OSS.
ossutil cp -r ./Qwen3.5-4B oss://<Your-Bucket-Name>/models/Qwen3.5-4B
3. Configure an OSS storage volume
-
Choose an authentication method (RRSA or AccessKey) and prepare the access credentials.
This topic uses AccessKey authentication. For other methods, see Use an ossfs 2.0 static persistent volume.
-
Store the AccessKey as a Kubernetes secret for PV access.
Replace
<yourAccessKeyID>and<yourAccessKeySecret>with your credentials. The secret namespace must match the application namespace.kubectl create -n default secret generic oss-secret --from-literal='akId=<yourAccessKeyID>' --from-literal='akSecret=<yourAccessKeySecret>' -
Create a PV and PVC to mount the OSS model directory in read-only mode. This example uses an ossfs 2.0 static persistent volume.
Step 3: Deploy and verify Knative service
1. Create a Knative Service
On the ACK Clusters page, click the name of your cluster. In the left navigation pane, click .
-
On the Service Management tab, click Create from Template. Set Sample Template to Custom and deploy the Knative Service.
Parameter
Description
autoscaling.knative.dev/metricThe autoscaling metric. Valid values:
-
concurrency(default): Scale by concurrency. -
rps: Scale by requests per second.
autoscaling.knative.dev/targetThe target metric value that triggers autoscaling.
autoscaling.knative.dev/minScaleMinimum replicas. Integer ≥ 0. Set to 0 to enable scale-to-zero.
autoscaling.knative.dev/maxScaleMaximum replicas. Limits scale-out.
-
2. Verify service deployment
-
On the Service Management tab, verify the service is ready. Note the default domain name and access gateway address.
Note: Send requests to the access gateway (format:
alb-xxx.aliyuncsslb.com) with a Host header set to the service domain (format:qwen.default.example.com). -
Send a test request to the inference service.
Replace
xx.40.85.xxwith your access gateway address andqwen.default.example.comwith your default domain name.curl http://xx.40.85.xx:80/v1/chat/completions \ -H "Host: qwen.default.example.com" \ -H "Content-Type: application/json" \ -d '{ "model": "Qwen3.5-4B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Tell me about Hangzhou" } ] } ], "max_tokens": 200 }'Expected output:
{ "id": "chatcmpl-20dfb4c8-d1ab-48bc-9f1a-78b84c6c8adf", "object": "chat.completion", "created": 1772602897, "model": "Qwen3.5-4B", "choices": [ { "index": 0, "message": { "role": "assistant", "content": "Hangzhou, abbreviated as 'Hang', is a sub-provincial city located in Zhejiang Province, China..." }, "finish_reason": "length" } ], "usage": { "prompt_tokens": 14, "completion_tokens": 200, "total_tokens": 214 } }