Container Service for Kubernetes (ACK) managed Pro clusters provide a streamlined environment for deploying Large Language Model (LLM) inference services, abstracting away the complexities of managing underlying hardware and dependencies. This allows you to quickly validate a model's inference capabilities without the common challenges of insufficient local GPU resources or complex environment setup.
Scenarios
You have an ACK managed Pro cluster running Kubernetes version 1.22 or later. The cluster must also include GPU-accelerated nodes, each with at least 16 GB of GPU memory.
The NVIDIA driver version is 535 or later. This topic uses an example where the
ack.aliyun.com/nvidia-driver-versionlabel is added to the GPU node pool with the value550.144.03.
Option 1: Quick deployment for testing
Use Arena to quickly deploy qwen/Qwen1.5-4B-Chat. This method is suitable for test scenarios and takes about 15 minutes.
Install the Arena client.
Deploy a custom service using Arena, specifying the container image for the deployment with the
--imageflag. See Appendix: Command parameter reference for the full list.This method downloads the ModelScope model files into the container. Ensure that your GPU node has at least 30 GB of available disk space to accommodate the model.
arena serve custom \ --name=modelscope \ --version=v1 \ --gpus=1 \ --replicas=1 \ --restful-port=8000 \ --readiness-probe-action="tcpSocket" \ --readiness-probe-action-option="port: 8000" \ --readiness-probe-option="initialDelaySeconds: 30" \ --readiness-probe-option="periodSeconds: 30" \ --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/quick-deploy-llm:v1 \ "MODEL_ID=qwen/Qwen1.5-4B-Chat python3 server.py"See Pull model files from the Hugging Face repository if needed.
The following output indicates that the Kubernetes resources for the
modelscope-v1inference service have been created:service/modelscope-v1 created deployment.apps/modelscope-v1-custom-serving created INFO[0002] The Job modelscope has been submitted successfully INFO[0002] You can run `arena serve get modelscope --type custom-serving -n default` to check the job statusCheck the service status.
It may take several minutes for the model to download, during which the pod will be in the
ContainerCreatingstate.arena serve get modelscopeOnce the pod status is
Running, themodelscopeinference service is ready.
Option 2: Production-ready deployment with persistent storage
For production environments, it is best practice to pre-load your model files onto a persistent storage volume, such as Object Storage Service (OSS). This method avoids repeatedly downloading large model files (over 10 GB) each time a pod starts, thereby greatly reducing cold-start times, lowering network bandwidth costs, and improving service stability.
Step 1: Prepare the model data
Download the model files from ModelScope.
Install Git and the Git Large File Storage (LFS) extension.
macOS
Install Git.
The officially maintained macOS Git installer is available from the official Git website.
brew install gitInstall the Git LFS extension to pull large files.
brew install git-lfs
Windows
Install Git.
Download and install a suitable version from the official Git website.
Install the Git LFS extension to pull large files. Git LFS is integrated into Git for Windows. Download and use the latest version.
Linux
Install Git.
The following command is for Red Hat-based Linux distributions. For other systems, see the official Git website.
yum install gitInstall the Git LFS extension to pull large files.
yum install git-lfs
Download the Qwen1.5-4B-Chat model.
GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/qwen/Qwen1.5-4B-Chat.git cd Qwen1.5-4B-Chat git lfs pull
Upload the Qwen1.5-4B-Chat model files to an OSS bucket.
Install and configure ossutil to manage OSS resources.
Create a bucket.
To accelerate model pulling, create the bucket in the same region as your cluster.
ossutil mb oss://<Your-Bucket-Name>Create a folder named
Qwen1.5-4B-Chatin your bucket.ossutil mkdir oss://<Your-Bucket-Name>/Qwen1.5-4B-ChatUpload the model files.
ossutil cp -r ./Qwen1.5-4B-Chat oss://<Your-Bucket-Name>/Qwen1.5-4B-Chat
Configure a persistent volume (PV).
Log on to the ACK console, then click the target cluster. In the left navigation pane, choose .
On the Persistent Volumes page, click Create. In the Create PV dialog box, configure the parameters, and click Create.
PV Type: select
OSSVolume Name:
llm-modelCapacity:
20GiAccess Mode:
ReadOnlyManyAccess Certificate: select Create Secret
Optional Parameters:
-o umask=022 -o max_stat_cache_size=0 -o allow_otherBucket ID: click Select Bucket
OSS Path:
/Qwen1.5-4B-ChatEndpoint: select Public Endpoint
Configure a persistent volume claim (PVC).
On the cluster details page, choose .
On the Persistent Volume Claims page, configure the parameters.
PVC Type: select
OSSName:
llm-modelAllocation Mode: select Existing Volumes
Existing Volumes: select the
llm-modelPV created in the previous stepCapacity:
20Gi
Step 2: Deploy the inference service
Install the Arena client.
Deploy the service using Arena. This command is similar to the testing deployment but includes the
--dataflag to mount the PVC containing the model files. See Appendix: Command parameter reference for the full list.arena serve custom \ --name=modelscope \ --version=v1 \ --gpus=1 \ --replicas=1 \ --restful-port=8000 \ --readiness-probe-action="tcpSocket" \ --readiness-probe-action-option="port: 8000" \ --readiness-probe-option="initialDelaySeconds: 30" \ --readiness-probe-option="periodSeconds: 30" \ --data=llm-model:/Qwen1.5-4B-Chat \ --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/quick-deploy-llm:v1 \ "MODEL_ID=/Qwen1.5-4B-Chat python3 server.py"The following output indicates that the inference service has been submitted:
service/modelscope-v1 created deployment.apps/modelscope-v1-custom-serving created INFO[0001] The Job modelscope has been submitted successfully INFO[0001] You can run `arena serve get modelscope --type custom-serving -n default` to check the job statusCheck the service status.
arena serve get modelscopeYou should see the pod is in the
Runningstate, indicating the inference service is ready.
Validate the inference service
Set up port forwarding to the inference service.
ImportantPort forwarding set up by using kubectl port-forward is not reliable, secure, or extensible in production environments. It is only for development and debugging. Do not use this command to set up port forwarding in production environments. For more information about networking solutions used for production in ACK clusters, see Ingress management.
kubectl port-forward svc/modelscope-v1 8000:8000Expected output:
Forwarding from 127.0.0.1:8000 -> 8000 Forwarding from [::1]:8000 -> 8000In a new terminal, send a sample inference request.
curl -X POST http://localhost:8000/generate \ -H "Content-Type: application/json" \ -d '{ "text_input": "What is artificial intelligence? Artificial intelligence is", "parameters": { "stream": false, "temperature": 0.9, "seed": 10 } }'A successful response will contain the model's generated text:
{"model_name":"/Qwen1.5-4B-Chat","text_output":"What is artificial intelligence? Artificial intelligence is a branch of computer science that studies how to make computers have intelligent behavior."}
(Optional) Clean up the environment
When you're finished, delete the inference service to release the resources.
Delete the deployed inference service.
arena serve del modelscopeDelete the created PV and PVC.
kubectl delete pvc llm-model kubectl delete pv llm-model
Appendix: Command parameter reference
Parameter | Description | Example |
| Arena subcommand. Deploys a custom model service instead of using a preset type such as | (N/A) |
| Service name. Specifies a unique name for the service to be deployed. The name will be used for subsequent management operations, such as viewing logs and deleting the service. |
|
| Service version. Specifies a version number for the service to facilitate operations such as version management and phased releases. |
|
| GPU resources. The number of GPUs allocated to each service (pod). This parameter is required if the model needs GPUs for inference. |
|
| Replica count. The number of service instances (pods) to run. Increasing the number of replicas can improve the service's concurrent processing capability and availability. |
|
| RESTful port. The port on which the service will expose its RESTful API to receive inference requests. |
|
| Readiness probe type. Sets the check method for the Kubernetes readiness probe, which determines if the container is ready to receive traffic. |
|
| Probe type options. Provides specific parameters for the chosen probe type. For |
|
| Other probe options. Sets additional parameters for the readiness probe's behavior. This parameter can be used multiple times. The example sets the initial delay and check interval. |
|
| Volume mount. Mounts a PVC to a specified path in the container. The format is |
|
| Container image. The full URL of the container image used to deploy the service. This defines the core runtime environment for the service. |
|
| Start command. The command to execute after the container starts. The example sets the |
|
FAQ
How can I pull model files from a Hugging Face repository?
Ensure the container runtime environment can access the Hugging Face repository.
Deploy a custom service using Arena, specifying the container image for the deployment with the
--imageflag. See Appendix: Command parameter reference for the full list.This method downloads the Hugging Face model files directly into the container. Ensure that your GPU node has at least 30 GB of available disk space to accommodate the model.
arena serve custom \ --name=huggingface \ --version=v1 \ --gpus=1 \ --replicas=1 \ --restful-port=8000 \ --readiness-probe-action="tcpSocket" \ --readiness-probe-action-option="port: 8000" \ --readiness-probe-option="initialDelaySeconds: 30" \ --readiness-probe-option="periodSeconds: 30" \ --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/quick-deploy-llm:v1 \ "MODEL_ID=Qwen/Qwen1.5-4B-Chat MODEL_SOURCE=Huggingface python3 server.py"The following output indicates that the Kubernetes resources for the
huggingface-v1inference service have been created:service/huggingface-v1 created deployment.apps/huggingface-v1-custom-serving created INFO[0003] The Job huggingface has been submitted successfully INFO[0003] You can run `arena serve get huggingface --type custom-serving -n default` to check the job status
References
To specify a driver version, see Specify an NVIDIA driver version for nodes by adding a label.
To use mature inference service frameworks, such as vLLM and Triton, in a production environment, see Deploy a Qwen model inference service using vLLM and Deploy a Qwen model inference service using Triton.