All Products
Search
Document Center

Container Service for Kubernetes:Quickly deploy LLM inference services on ACK

Last Updated:Nov 07, 2025

Container Service for Kubernetes (ACK) managed Pro clusters provide a streamlined environment for deploying Large Language Model (LLM) inference services, abstracting away the complexities of managing underlying hardware and dependencies. This allows you to quickly validate a model's inference capabilities without the common challenges of insufficient local GPU resources or complex environment setup.

Scenarios

  • You have an ACK managed Pro cluster running Kubernetes version 1.22 or later. The cluster must also include GPU-accelerated nodes, each with at least 16 GB of GPU memory.

  • The NVIDIA driver version is 535 or later. This topic uses an example where the ack.aliyun.com/nvidia-driver-version label is added to the GPU node pool with the value 550.144.03.

Option 1: Quick deployment for testing

Use Arena to quickly deploy qwen/Qwen1.5-4B-Chat. This method is suitable for test scenarios and takes about 15 minutes.

  1. Install the Arena client.

  2. Deploy a custom service using Arena, specifying the container image for the deployment with the --image flag. See Appendix: Command parameter reference for the full list.

    This method downloads the ModelScope model files into the container. Ensure that your GPU node has at least 30 GB of available disk space to accommodate the model.
    arena serve custom \
        --name=modelscope \
        --version=v1 \
        --gpus=1 \
        --replicas=1 \
        --restful-port=8000 \
        --readiness-probe-action="tcpSocket" \
        --readiness-probe-action-option="port: 8000" \
        --readiness-probe-option="initialDelaySeconds: 30" \
        --readiness-probe-option="periodSeconds: 30" \
        --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/quick-deploy-llm:v1 \
        "MODEL_ID=qwen/Qwen1.5-4B-Chat python3 server.py"
    See Pull model files from the Hugging Face repository if needed.

    The following output indicates that the Kubernetes resources for the modelscope-v1 inference service have been created:

    service/modelscope-v1 created
    deployment.apps/modelscope-v1-custom-serving created
    INFO[0002] The Job modelscope has been submitted successfully
    INFO[0002] You can run `arena serve get modelscope --type custom-serving -n default` to check the job status
  3. Check the service status.

    It may take several minutes for the model to download, during which the pod will be in the ContainerCreating state.
    arena serve get modelscope

    Once the pod status is Running, the modelscope inference service is ready.

Option 2: Production-ready deployment with persistent storage

For production environments, it is best practice to pre-load your model files onto a persistent storage volume, such as Object Storage Service (OSS). This method avoids repeatedly downloading large model files (over 10 GB) each time a pod starts, thereby greatly reducing cold-start times, lowering network bandwidth costs, and improving service stability.

Step 1: Prepare the model data

  1. Download the model files from ModelScope.

    1. Install Git and the Git Large File Storage (LFS) extension.

      macOS
      1. Install Git.

        The officially maintained macOS Git installer is available from the official Git website.
        brew install git
      2. Install the Git LFS extension to pull large files.

        brew install git-lfs
      Windows
      1. Install Git.

        Download and install a suitable version from the official Git website.

      2. Install the Git LFS extension to pull large files. Git LFS is integrated into Git for Windows. Download and use the latest version.

      Linux
      1. Install Git.

        The following command is for Red Hat-based Linux distributions. For other systems, see the official Git website.

        yum install git
      2. Install the Git LFS extension to pull large files.

        yum install git-lfs
    2. Download the Qwen1.5-4B-Chat model.

      GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/qwen/Qwen1.5-4B-Chat.git
      cd Qwen1.5-4B-Chat
      git lfs pull
  2. Upload the Qwen1.5-4B-Chat model files to an OSS bucket.

    1. Install and configure ossutil to manage OSS resources.

    2. Create a bucket.

      To accelerate model pulling, create the bucket in the same region as your cluster.
      ossutil mb oss://<Your-Bucket-Name>
    3. Create a folder named Qwen1.5-4B-Chat in your bucket.

      ossutil mkdir oss://<Your-Bucket-Name>/Qwen1.5-4B-Chat
    4. Upload the model files.

      ossutil cp -r ./Qwen1.5-4B-Chat oss://<Your-Bucket-Name>/Qwen1.5-4B-Chat
  3. Configure a persistent volume (PV).

    1. Log on to the ACK console, then click the target cluster. In the left navigation pane, choose Volumes > Persistent Volumes.

    2. On the Persistent Volumes page, click Create. In the Create PV dialog box, configure the parameters, and click Create.

      • PV Type: select OSS

      • Volume Name: llm-model

      • Capacity: 20Gi

      • Access Mode: ReadOnlyMany

      • Access Certificate: select Create Secret

      • Optional Parameters: -o umask=022 -o max_stat_cache_size=0 -o allow_other

      • Bucket ID: click Select Bucket

      • OSS Path: /Qwen1.5-4B-Chat

      • Endpoint: select Public Endpoint

  4. Configure a persistent volume claim (PVC).

    1. On the cluster details page, choose Volumes > Persistent Volume Claims.

    2. On the Persistent Volume Claims page, configure the parameters.

      • PVC Type: select OSS

      • Name: llm-model

      • Allocation Mode: select Existing Volumes

      • Existing Volumes: select the llm-model PV created in the previous step

      • Capacity: 20Gi

Step 2: Deploy the inference service

  1. Install the Arena client.

  2. Deploy the service using Arena. This command is similar to the testing deployment but includes the --data flag to mount the PVC containing the model files. See Appendix: Command parameter reference for the full list.

    arena serve custom \
        --name=modelscope \
        --version=v1 \
        --gpus=1 \
        --replicas=1 \
        --restful-port=8000 \
        --readiness-probe-action="tcpSocket" \
        --readiness-probe-action-option="port: 8000" \
        --readiness-probe-option="initialDelaySeconds: 30" \
        --readiness-probe-option="periodSeconds: 30" \
        --data=llm-model:/Qwen1.5-4B-Chat \
        --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/quick-deploy-llm:v1 \
        "MODEL_ID=/Qwen1.5-4B-Chat python3 server.py"

    The following output indicates that the inference service has been submitted:

    service/modelscope-v1 created
    deployment.apps/modelscope-v1-custom-serving created
    INFO[0001] The Job modelscope has been submitted successfully
    INFO[0001] You can run `arena serve get modelscope --type custom-serving -n default` to check the job status
  3. Check the service status.

    arena serve get modelscope

    You should see the pod is in the Running state, indicating the inference service is ready.

Validate the inference service

  1. Set up port forwarding to the inference service.

    Important

    Port forwarding set up by using kubectl port-forward is not reliable, secure, or extensible in production environments. It is only for development and debugging. Do not use this command to set up port forwarding in production environments. For more information about networking solutions used for production in ACK clusters, see Ingress management.

    kubectl port-forward svc/modelscope-v1 8000:8000

    Expected output:

    Forwarding from 127.0.0.1:8000 -> 8000
    Forwarding from [::1]:8000 -> 8000
  2. In a new terminal, send a sample inference request.

    curl -X POST http://localhost:8000/generate \
      -H "Content-Type: application/json" \
      -d '{
        "text_input": "What is artificial intelligence? Artificial intelligence is", 
        "parameters": {
          "stream": false, 
          "temperature": 0.9, 
          "seed": 10
        }
      }'

    A successful response will contain the model's generated text:

    {"model_name":"/Qwen1.5-4B-Chat","text_output":"What is artificial intelligence? Artificial intelligence is a branch of computer science that studies how to make computers have intelligent behavior."}

(Optional) Clean up the environment

When you're finished, delete the inference service to release the resources.

  • Delete the deployed inference service.

    arena serve del modelscope
  • Delete the created PV and PVC.

    kubectl delete pvc llm-model
    kubectl delete pv llm-model

Appendix: Command parameter reference

Parameter

Description

Example

serve custom

Arena subcommand. Deploys a custom model service instead of using a preset type such as tfserving or triton.

(N/A)

--name

Service name. Specifies a unique name for the service to be deployed. The name will be used for subsequent management operations, such as viewing logs and deleting the service.

modelscope

--version

Service version. Specifies a version number for the service to facilitate operations such as version management and phased releases.

v1

--gpus

GPU resources. The number of GPUs allocated to each service (pod). This parameter is required if the model needs GPUs for inference.

1

--replicas

Replica count. The number of service instances (pods) to run. Increasing the number of replicas can improve the service's concurrent processing capability and availability.

1

--restful-port

RESTful port. The port on which the service will expose its RESTful API to receive inference requests.

8000

--readiness-probe-action

Readiness probe type. Sets the check method for the Kubernetes readiness probe, which determines if the container is ready to receive traffic.

tcpSocket

--readiness-probe-action-option

Probe type options. Provides specific parameters for the chosen probe type. For tcpSocket, it specifies the port to check.

port: 8000

--readiness-probe-option

Other probe options. Sets additional parameters for the readiness probe's behavior. This parameter can be used multiple times. The example sets the initial delay and check interval.

initialDelaySeconds: 30
periodSeconds: 30

--data

Volume mount. Mounts a PVC to a specified path in the container. The format is PVC name:mount path. This is commonly used to mount model files.

llm-model:/Qwen1.5-4B-Chat

--image

Container image. The full URL of the container image used to deploy the service. This defines the core runtime environment for the service.

kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/quick-deploy-llm:v1

[COMMAND]

Start command. The command to execute after the container starts. The example sets the MODEL_ID environment variable and starts the server.py script.

"MODEL_ID=/Qwen1.5-4B-Chat python3 server.py"

FAQ

How can I pull model files from a Hugging Face repository?

  1. Ensure the container runtime environment can access the Hugging Face repository.

  2. Deploy a custom service using Arena, specifying the container image for the deployment with the --image flag. See Appendix: Command parameter reference for the full list.

    This method downloads the Hugging Face model files directly into the container. Ensure that your GPU node has at least 30 GB of available disk space to accommodate the model.
    arena serve custom \
        --name=huggingface \
        --version=v1 \
        --gpus=1 \
        --replicas=1 \
        --restful-port=8000 \
        --readiness-probe-action="tcpSocket" \
        --readiness-probe-action-option="port: 8000" \
        --readiness-probe-option="initialDelaySeconds: 30" \
        --readiness-probe-option="periodSeconds: 30" \
        --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/quick-deploy-llm:v1 \
        "MODEL_ID=Qwen/Qwen1.5-4B-Chat MODEL_SOURCE=Huggingface python3 server.py"

    The following output indicates that the Kubernetes resources for the huggingface-v1 inference service have been created:

    service/huggingface-v1 created
    deployment.apps/huggingface-v1-custom-serving created
    INFO[0003] The Job huggingface has been submitted successfully 
    INFO[0003] You can run `arena serve get huggingface --type custom-serving -n default` to check the job status 

References