All Products
Search
Document Center

Container Service for Kubernetes:Quickly deploy a large language model inference service in ACK

Last Updated:Mar 26, 2026

ACK managed Pro clusters give you a ready-to-use environment for running large language model (LLM) inference services—no local GPU hardware required, no complex dependency setup. This guide covers two deployment paths: a quick option for validating a model in about 15 minutes, and a production-grade option that pre-loads model files onto persistent storage to reduce cold-start times and bandwidth costs.

Prerequisites

Before you begin, ensure that you have:

  • An ACK managed Pro cluster running Kubernetes 1.22 or later

  • At least one GPU-accelerated node with 16 GB or more of GPU memory

  • NVIDIA driver version 535 or later installed on the GPU node pool (this guide uses 550.144.03, set via the ack.aliyun.com/nvidia-driver-version label)

  • The Arena client installed

Choose a deployment path

Option 1: Quick test Option 2: Production
Setup time ~15 minutes Longer (model pre-upload required)
Model storage Downloaded into the container at startup Pre-loaded on Object Storage Service (OSS)
Cold-start Slow — model re-downloads on every pod restart Fast — model is already on the mounted volume
Best for Validating inference capabilities Stable, repeatable production workloads

Option 1: Quick deployment for testing

Use Arena to deploy qwen/Qwen1.5-4B-Chat from ModelScope. The container downloads the model at startup, so the GPU node needs at least 30 GB of free disk space.

  1. Run the Arena command to deploy the inference service:

    arena serve custom \
        --name=modelscope \
        --version=v1 \
        --gpus=1 \
        --replicas=1 \
        --restful-port=8000 \
        --readiness-probe-action="tcpSocket" \
        --readiness-probe-action-option="port: 8000" \
        --readiness-probe-option="initialDelaySeconds: 30" \
        --readiness-probe-option="periodSeconds: 30" \
        --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/quick-deploy-llm:v1 \
        "MODEL_ID=qwen/Qwen1.5-4B-Chat python3 server.py"
    To pull model files from a Hugging Face repository, see Pull models from Hugging Face.

    The following output confirms the Kubernetes resources for modelscope-v1 were created:

    service/modelscope-v1 created
    deployment.apps/modelscope-v1-custom-serving created
    INFO[0002] The Job modelscope has been submitted successfully
    INFO[0002] You can run `arena serve get modelscope --type custom-serving -n default` to check the job status
  2. Check the service status. The pod stays in ContainerCreating while the model downloads. Depending on network conditions, this can take 5–15 minutes:

    arena serve get modelscope

    Once the pod status shows Running, the inference service is ready.

Option 2: Production-ready deployment with persistent storage

Pre-loading model files on OSS avoids re-downloading files larger than 10 GB every time a pod restarts. This reduces cold-start times, lowers bandwidth costs, and improves service stability.

Step 1: Download the model files

  1. Install Git and Git Large File Storage (LFS). macOS

    brew install git
    brew install git-lfs

    Windows Download and install Git from the official Git website. Git Large File Storage is bundled with Git for Windows — download the latest version. Linux (Red Hat-based)

    yum install git
    yum install git-lfs

    For other Linux distributions, see the official Git website.

  2. Clone the Qwen1.5-4B-Chat model repository and pull the large files:

    GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/qwen/Qwen1.5-4B-Chat.git
    cd Qwen1.5-4B-Chat
    git lfs pull

Step 2: Upload model files to OSS

  1. Install and configure ossutil.

  2. Create a bucket. To reduce model pull latency, create the bucket in the same region as your cluster:

    ossutil mb oss://<your-bucket-name>
  3. Create a folder in the bucket for the model files:

    ossutil mkdir oss://<your-bucket-name>/Qwen1.5-4B-Chat
  4. Upload the model files:

    ossutil cp -r ./Qwen1.5-4B-Chat oss://<your-bucket-name>/Qwen1.5-4B-Chat

Step 3: Configure a persistent volume (PV)

  1. Log on to the ACK console and click the target cluster. In the left navigation pane, choose Volumes > Persistent Volumes.

  2. Click Create. In the Create PV dialog box, set the following parameters and click Create:

    Parameter Value
    PV type OSS
    Volume name llm-model
    Capacity 20Gi
    Access mode ReadOnlyMany
    Access certificate Select Create Secret
    Optional parameters -o umask=022 -o max_stat_cache_size=0 -o allow_other
    Bucket ID Click Select Bucket and select your bucket
    OSS path /Qwen1.5-4B-Chat
    Endpoint Select Public Endpoint

Step 4: Configure a persistent volume claim (PVC)

  1. In the left navigation pane, choose Volumes > Persistent Volume Claims.

  2. On the Persistent Volume Claims page, set the following parameters and click Create:

    Parameter Value
    PVC type OSS
    Name llm-model
    Allocation mode Select Existing Volumes
    Existing volumes Select the llm-model PV created in the previous step
    Capacity 20Gi

Step 5: Deploy the inference service

Run the Arena command to deploy the service. The --data flag mounts the PVC containing the pre-loaded model files. Because the model is already on the mounted volume, the pod starts without downloading anything:

arena serve custom \
    --name=modelscope \
    --version=v1 \
    --gpus=1 \
    --replicas=1 \
    --restful-port=8000 \
    --readiness-probe-action="tcpSocket" \
    --readiness-probe-action-option="port: 8000" \
    --readiness-probe-option="initialDelaySeconds: 30" \
    --readiness-probe-option="periodSeconds: 30" \
    --data=llm-model:/Qwen1.5-4B-Chat \
    --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/quick-deploy-llm:v1 \
    "MODEL_ID=/Qwen1.5-4B-Chat python3 server.py"

The following output confirms the inference service was submitted:

service/modelscope-v1 created
deployment.apps/modelscope-v1-custom-serving created
INFO[0001] The Job modelscope has been submitted successfully
INFO[0001] You can run `arena serve get modelscope --type custom-serving -n default` to check the job status

Check the service status:

arena serve get modelscope

Once the pod status shows Running, the inference service is ready.

Validate the inference service

  1. Set up port forwarding to the inference service:

    Important

    kubectl port-forward is for development and debugging only. It is not reliable, secure, or scalable in production. For production networking, see Ingress management.

    kubectl port-forward svc/modelscope-v1 8000:8000

    Expected output:

    Forwarding from 127.0.0.1:8000 -> 8000
    Forwarding from [::1]:8000 -> 8000
  2. In a new terminal, send a test inference request:

    curl -X POST http://localhost:8000/generate \
      -H "Content-Type: application/json" \
      -d '{
        "text_input": "What is artificial intelligence? Artificial intelligence is",
        "parameters": {
          "stream": false,
          "temperature": 0.9,
          "seed": 10
        }
      }'

    A successful response contains the model's generated text:

    {"model_name":"/Qwen1.5-4B-Chat","text_output":"What is artificial intelligence? Artificial intelligence is a branch of computer science that studies how to make computers have intelligent behavior."}

(Optional) Clean up

Delete the inference service and storage resources when you're done:

# Delete the inference service
arena serve del modelscope

# Delete the PVC and PV (Option 2 only)
kubectl delete pvc llm-model
kubectl delete pv llm-model

FAQ

How can I pull model files from Hugging Face instead of ModelScope?

Make sure the container runtime can reach the Hugging Face repository, then set MODEL_SOURCE=Huggingface in the Arena command. The GPU node needs at least 30 GB of free disk space to accommodate the downloaded files:

arena serve custom \
    --name=huggingface \
    --version=v1 \
    --gpus=1 \
    --replicas=1 \
    --restful-port=8000 \
    --readiness-probe-action="tcpSocket" \
    --readiness-probe-action-option="port: 8000" \
    --readiness-probe-option="initialDelaySeconds: 30" \
    --readiness-probe-option="periodSeconds: 30" \
    --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/quick-deploy-llm:v1 \
    "MODEL_ID=Qwen/Qwen1.5-4B-Chat MODEL_SOURCE=Huggingface python3 server.py"

The following output confirms the resources were created:

service/huggingface-v1 created
deployment.apps/huggingface-v1-custom-serving created
INFO[0003] The Job huggingface has been submitted successfully
INFO[0003] You can run `arena serve get huggingface --type custom-serving -n default` to check the job status

Appendix: command parameter reference

Parameter Description Example
serve custom Arena subcommand. Deploys a custom model service rather than a preset type such as tfserving or triton.
--name Service name. A unique identifier used for subsequent operations such as checking logs and deleting the service. modelscope
--version Service version. A version label for the service, useful for version management and phased releases. v1
--gpus GPU count. The number of GPUs allocated to each pod. Required when the model needs GPUs for inference. 1
--replicas Replica count. The number of pods to run. More replicas increase concurrent throughput and availability. 1
--restful-port RESTful API port. The port on which the service exposes its RESTful API to receive inference requests. 8000
--readiness-probe-action Readiness probe type. The check method used by the Kubernetes readiness probe to determine whether the container is ready to receive traffic. tcpSocket
--readiness-probe-action-option Probe type options. Parameters for the chosen probe type. For tcpSocket, specifies the port to check. port: 8000
--readiness-probe-option Additional probe settings. Extra parameters for the readiness probe. This flag can be repeated. Sets the initial delay and check interval. initialDelaySeconds: 30, periodSeconds: 30
--data Volume mount. Mounts a PVC at a specified path inside the container, in the format <pvc-name>:<mount-path>. Used to mount pre-loaded model files. llm-model:/Qwen1.5-4B-Chat
--image Container image. The full URL of the container image that defines the runtime environment for the service. kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/quick-deploy-llm:v1
[COMMAND] Startup command. The command to run after the container starts. Sets the MODEL_ID environment variable and launches server.py. "MODEL_ID=/Qwen1.5-4B-Chat python3 server.py"

FAQ

Pull models from Hugging Face

  1. Make sure that the container runtime environment can access the Hugging Face repository.

  2. Use the Arena client to deploy a custom service and specify the container image for deployment using the --image parameter. For more information about the parameters, see Command parameter reference.

    This method downloads the Hugging Face model files into the container. Ensure that your GPU node has at least 30 GB of available disk space.
    arena serve custom \
        --name=huggingface \
        --version=v1 \
        --gpus=1 \
        --replicas=1 \
        --restful-port=8000 \
        --readiness-probe-action="tcpSocket" \
        --readiness-probe-action-option="port: 8000" \
        --readiness-probe-option="initialDelaySeconds: 30" \
        --readiness-probe-option="periodSeconds: 30" \
        --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/quick-deploy-llm:v1 \
        "MODEL_ID=Qwen/Qwen1.5-4B-Chat MODEL_SOURCE=Huggingface python3 server.py"

    The following output indicates that the Kubernetes resources for the huggingface-v1 inference service have been created:

    service/huggingface-v1 created
    deployment.apps/huggingface-v1-custom-serving created
    INFO[0003] The Job huggingface has been submitted successfully 
    INFO[0003] You can run `arena serve get huggingface --type custom-serving -n default` to check the job status 

What's next