All Products
Search
Document Center

Container Service for Kubernetes:Deploy a Qwen2 model inference service using TensorRT-LLM

Last Updated:Mar 26, 2026

Deploy the Qwen2-1.5B-Instruct model as an inference service on ACK using TensorRT-LLM and Triton Inference Server. This tutorial uses an A10 GPU and walks through the full pipeline: downloading the model from ModelScope, converting it to a TensorRT engine, pre-loading the engine into cache with Fluid, and serving it through Triton.

Background

Qwen2-1.5B-Instruct

Qwen2-1.5B-Instruct is a Transformer-based large language model (LLM) with 1.5 billion parameters, trained on web text, professional books, and code. For more information, see the Qwen2 GitHub repository.

Triton Inference Server

Triton Inference Server is an open source inference serving framework from NVIDIA. It supports multiple ML backends — including TensorRT, TensorFlow, PyTorch, ONNX, and vLLM — and is optimized for real-time, batch, and audio and video streaming inference workloads. For more information, see the Triton Inference Server GitHub repository.

TensorRT-LLM

TensorRT-LLM is an open source library from NVIDIA that compiles LLMs into TensorRT engines optimized for NVIDIA GPUs. It supports both Tensor Parallelism and Pipeline Parallelism, and integrates with Triton as the TensorRT-LLM backend. For more information, see the TensorRT-LLM GitHub repository.

Prerequisites

Before you begin, make sure you have:

  • An ACK Managed Cluster Pro Edition (version 1.22 or later) with A10 GPU nodes. GPU nodes must use driver version 525. To pin the driver to version 525.105.17, add the label ack.aliyun.com/nvidia-driver-version:525.105.17 to the GPU node pool. For more information, see Create an ACK managed cluster and Customize the GPU driver version of a node by specifying a version number.

  • The cloud-native AI suite installed with the ack-fluid component deployed.

    • If the cloud-native AI suite is not yet installed: deploy Fluid and enable data caching acceleration. See Deploy the cloud-native AI suite.

    • If the cloud-native AI suite is already installed: go to the Marketplace page in the ACK console and deploy the ack-fluid component.

    Important

    If you have open source Fluid installed, uninstall it before deploying ack-fluid.

  • The latest version of the Arena client. See Configure the Arena client.

  • Object Storage Service (OSS) activated with a bucket created. See Activate OSS and Create buckets.

Step 1: Create a Dataset and a JindoRuntime

A Dataset describes the remote storage location of the model files. A JindoRuntime provides a caching layer in front of OSS so that subsequent reads hit memory rather than the network. Together, they significantly reduce model loading time during inference.

  1. Create a Secret to store your OSS credentials.

    kubectl apply -f - <<EOF
    apiVersion: v1
    kind: Secret
    metadata:
      name: fluid-oss-secret
    stringData:
      fs.oss.accessKeyId: <Your AccessKey ID>
      fs.oss.accessKeySecret: <Your AccessKey secret>
    EOF

    Replace <Your AccessKey ID> and <Your AccessKey secret> with your credentials. To get an AccessKey pair, see Obtain an AccessKey. Expected output:

    secret/fluid-oss-secret created
  2. Create a file named dataset.yaml with the following content. This configuration creates a Dataset pointing to the OSS path where the model is stored, and a JindoRuntime that caches model data in 20 GiB of shared memory across two replicas.

    # Dataset: describes the remote data source and mount configuration.
    apiVersion: data.fluid.io/v1alpha1
    kind: Dataset
    metadata:
      name: qwen2-oss
    spec:
      mounts:
      - mountPoint: oss://<oss_bucket>/qwen2-1.5b   # Replace with your OSS path.
        name: qwen2
        path: /
        options:
          fs.oss.endpoint: <oss_endpoint>            # Replace with your OSS endpoint.
        encryptOptions:
          - name: fs.oss.accessKeyId
            valueFrom:
              secretKeyRef:
                name: fluid-oss-secret
                key: fs.oss.accessKeyId
          - name: fs.oss.accessKeySecret
            valueFrom:
              secretKeyRef:
                name: fluid-oss-secret
                key: fs.oss.accessKeySecret
      accessModes:
        - ReadWriteMany
    # JindoRuntime: starts a JindoFS cluster that provides caching services.
    ---
    apiVersion: data.fluid.io/v1alpha1
    kind: JindoRuntime
    metadata:
      name: qwen2-oss
    spec:
      replicas: 2
      tieredstore:
        levels:
          - mediumtype: MEM
            volumeType: emptyDir
            path: /dev/shm
            quota: 20Gi
            high: "0.95"
            low: "0.7"
      fuse:
        properties:
          fs.oss.read.buffer.size: "8388608"              # 8 MiB read buffer
          fs.oss.download.thread.concurrency: "200"
          fs.oss.read.readahead.max.buffer.count: "200"
          fs.oss.read.sequence.ambiguity.range: "2147483647"
        args:
          - -oauto_cache
          - -oattr_timeout=1
          - -oentry_timeout=1
          - -onegative_timeout=1

    For more information about Dataset and JindoRuntime configuration, see Accelerate access to OSS files using JindoFS.

  3. Apply the configuration.

    kubectl apply -f dataset.yaml

    Expected output:

    dataset.data.fluid.io/qwen2-oss created
    jindoruntime.data.fluid.io/qwen2-oss created
  4. Verify that the Dataset is bound and ready.

    kubectl get dataset qwen2-oss

    Expected output:

    NAME        UFS TOTAL SIZE   CACHED   CACHE CAPACITY   CACHED PERCENTAGE   PHASE   AGE
    qwen2-oss   0.00B            0.00B    20.00GiB         0.0%                Bound   57s

    The PHASE: Bound status confirms the Dataset is ready.

Step 2: Create a Dataflow

The Dataflow automates the full model preparation pipeline:

  1. Download — pull Qwen2-1.5B-Instruct from ModelScope into the OSS-backed Dataset.

  2. Convert and build — convert the model checkpoint to float16 format and compile it into a TensorRT engine.

  3. Warm up cache — pre-load the engine and Triton backend configuration into the JindoRuntime memory cache.

The convert-and-build step is the most time-consuming. It requires one GPU and up to 30 GiB of memory, and typically takes 15–20 minutes. See Troubleshooting if it fails.
  1. Create a file named dataflow.yaml with the following content.

    # Step 1: Download Qwen2-1.5B-Instruct from ModelScope.
    apiVersion: data.fluid.io/v1alpha1
    kind: DataProcess
    metadata:
      name: step1-download-model
    spec:
      dataset:
        name: qwen2-oss
        namespace: default
        mountPath: /mnt/models/
      processor:
        script:
          image: ac2-registry.cn-hangzhou.cr.aliyuncs.com/ac2/base
          imageTag: ubuntu22.04
          imagePullPolicy: IfNotPresent
          restartPolicy: OnFailure
          command:
          - bash
          source: |
            #!/bin/bash
            echo "Downloading model..."
            if [ -d "${MODEL_MOUNT_PATH}/Qwen2-1.5B-Instruct" ]; then
                echo "Directory ${MODEL_MOUNT_PATH}/Qwen2-1.5B-Instruct exists. Skipping model download."
            else
                apt update && apt install -y git git-lfs
                git clone https://www.modelscope.cn/qwen/Qwen2-1.5B-Instruct.git Qwen2-1.5B-Instruct
                mv Qwen2-1.5B-Instruct ${MODEL_MOUNT_PATH}
            fi
          env:
          - name: MODEL_MOUNT_PATH
            value: "/mnt/models"
    # Step 2: Convert the model checkpoint and build the TensorRT engine.
    ---
    apiVersion: data.fluid.io/v1alpha1
    kind: DataProcess
    metadata:
      name: step2-trtllm-convert
    spec:
      runAfter:
        kind: DataProcess
        name: step1-download-model
        namespace: default
      dataset:
        name: qwen2-oss
        namespace: default
        mountPath: /mnt/models/
      processor:
        script:
          image: kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/tritonserver-build
          imageTag: 24.07-trtllm-python-py3
          imagePullPolicy: IfNotPresent
          restartPolicy: OnFailure
          command:
          - bash
          source: |
            #!/bin/bash
            set -ex
    
            cd /tensorrtllm_backend/tensorrt_llm/examples/qwen
            if [ -d "${MODEL_MOUNT_PATH}/Qwen2-1.5B-Instruct-ckpt" ]; then
                echo "Directory ${MODEL_MOUNT_PATH}/Qwen2-1.5B-Instruct-ckpt exists. Skipping checkpoint conversion."
            else
                echo "Converting checkpoint..."
                python3 convert_checkpoint.py --model_dir ${MODEL_MOUNT_PATH}/Qwen2-1.5B-Instruct --output_dir /root/Qwen2-1.5B-Instruct-ckpt --dtype float16
    
                echo "Writing TensorRT-LLM model checkpoint to OSS bucket..."
                mv /root/Qwen2-1.5B-Instruct-ckpt ${MODEL_MOUNT_PATH}
            fi
    
            sleep 2
    
            if [ -d "${MODEL_MOUNT_PATH}/Qwen2-1.5B-Instruct-engine" ]; then
                echo "Directory $OUTPUT_DIR/Qwen2-1.5B-Instruct-engine exists. Skipping engine build."
            else
                echo "Building TensorRT-LLM engine..."
                trtllm-build --checkpoint_dir ${MODEL_MOUNT_PATH}/Qwen2-1.5B-Instruct-ckpt \
                --gemm_plugin float16 \
                --paged_kv_cache enable \
                --output_dir /root/Qwen2-1.5B-Instruct-engine
    
                echo "Writing TensorRT-LLM engine to OSS bucket..."
                mv /root/Qwen2-1.5B-Instruct-engine ${MODEL_MOUNT_PATH}
            fi
    
            if [ -d "${MODEL_MOUNT_PATH}/tensorrtllm_backend" ]; then
                echo "Directory $OUTPUT_DIR/tensorrtllm_backend exists. Skipping tensorrtllm_backend configuration."
            else
                echo "Configuring model..."
                cd /tensorrtllm_backend
                cp all_models/inflight_batcher_llm/ qwen2_ifb -r
                export QWEN2_MODEL=${MODEL_MOUNT_PATH}/Qwen2-1.5B-Instruct
                export ENGINE_PATH=${MODEL_MOUNT_PATH}/Qwen2-1.5B-Instruct-engine
    
                python3 tools/fill_template.py -i qwen2_ifb/preprocessing/config.pbtxt tokenizer_dir:${QWEN2_MODEL},triton_max_batch_size:8,preprocessing_instance_count:1
                python3 tools/fill_template.py -i qwen2_ifb/postprocessing/config.pbtxt tokenizer_dir:${QWEN2_MODEL},triton_max_batch_size:8,postprocessing_instance_count:1
                python3 tools/fill_template.py -i qwen2_ifb/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:8,decoupled_mode:False,bls_instance_count:1,accumulate_tokens:False
                python3 tools/fill_template.py -i qwen2_ifb/ensemble/config.pbtxt triton_max_batch_size:8
                python3 tools/fill_template.py -i qwen2_ifb/tensorrt_llm/config.pbtxt triton_backend:tensorrtllm,triton_max_batch_size:8,decoupled_mode:False,max_beam_width:1,engine_dir:${ENGINE_PATH},max_tokens_in_paged_kv_cache:1280,max_attention_window_size:1280,kv_cache_free_gpu_mem_fraction:0.5,exclude_input_in_output:True,enable_kv_cache_reuse:False,batching_strategy:inflight_fused_batching,max_queue_delay_microseconds:0
    
                echo "Writing TensorRT-LLM configuration to OSS bucket..."
                mkdir -p ${MODEL_MOUNT_PATH}/tensorrtllm_backend
                mv /tensorrtllm_backend/qwen2_ifb ${MODEL_MOUNT_PATH}/tensorrtllm_backend
            fi
          env:
          - name: MODEL_MOUNT_PATH
            value: "/mnt/models"
          resources:
            requests:
              cpu: 2
              memory: 10Gi
              nvidia.com/gpu: 1
            limits:
              cpu: 12
              memory: 30Gi
              nvidia.com/gpu: 1
    # Step 3: Pre-load the engine and backend configuration into the JindoRuntime cache.
    ---
    apiVersion: data.fluid.io/v1alpha1
    kind: DataLoad
    metadata:
      name: step3-warmup-cache
    spec:
      runAfter:
        kind: DataProcess
        name: step2-trtllm-convert
        namespace: default
      dataset:
        name: qwen2-oss
        namespace: default
      loadMetadata: true
      target:
      - path: /Qwen2-1.5B-Instruct-engine
      - path: /tensorrtllm_backend
  2. Create the Dataflow.

    kubectl create -f dataflow.yaml

    Expected output:

    dataprocess.data.fluid.io/step1-download-model created
    dataprocess.data.fluid.io/step2-trtllm-convert created
    dataload.data.fluid.io/step3-warmup-cache created
  3. Monitor progress and wait for all steps to complete.

    kubectl get dataprocess

    Expected output (both steps show Complete):

    NAME                   DATASET     PHASE      AGE   DURATION
    step1-download-model   qwen2-oss   Complete   23m   3m2s
    step2-trtllm-convert   qwen2-oss   Complete   23m   19m58s

    If a step fails or gets stuck, see Troubleshooting.

Step 3: Deploy the inference service

The three most important parameters for this deployment are:

  • --gpus=1 — allocates one GPU per pod for model inference

  • --data=qwen2-oss:/mnt/models — mounts the Fluid-managed PersistentVolumeClaim (PVC) containing the TensorRT engine and Triton configuration

  • --image — specifies the Triton container image with TensorRT-LLM backend support

For descriptions of all parameters, see Appendix: Arena command parameter reference.

  1. Deploy the inference service using Arena.

    arena serve custom \
      --name=qwen2-chat \
      --version=v1 \
      --gpus=1 \
      --replicas=1 \
      --restful-port=8000 \
      --readiness-probe-action="tcpSocket" \
      --readiness-probe-action-option="port: 8000" \
      --readiness-probe-option="initialDelaySeconds: 30" \
      --readiness-probe-option="periodSeconds: 30" \
      --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/tritonserver:24.07-trtllm-python-py3 \
      --data=qwen2-oss:/mnt/models \
      "tritonserver --model-repository=/mnt/models/tensorrtllm_backend/qwen2_ifb --http-port=8000 --grpc-port=8001 --metrics-port=8002 --disable-auto-complete-config --backend-config=python,shm-region-prefix-name=prefix0_"

    Expected output:

    service/qwen2-chat-v1 created
    deployment.apps/qwen2-chat-v1-custom-serving created
    INFO[0003] The Job qwen2-chat has been submitted successfully
    INFO[0003] You can run `arena serve get qwen2-chat --type custom-serving -n default` to check the job status
  2. Wait for the service to become ready.

    arena serve get qwen2-chat

    Expected output:

    Name:       qwen2-chat
    Namespace:  default
    Type:       Custom
    Version:    v1
    Desired:    1
    Available:  1
    Age:        1m
    Address:    192.XX.XX.XX
    Port:       RESTFUL:8000
    GPU:        1
    
    Instances:
      NAME                                           STATUS   AGE  READY  RESTARTS  GPU  NODE
      ----                                           ------   ---  -----  --------  ---  ----
      qwen2-chat-v1-custom-serving-657869c698-hl665  Running  1m   1/1    0         1    ap-southeast-1.192.XX.XX.XX

    Proceed to the next step once Available: 1 and the pod status shows READY 1/1.

Step 4: Validate the inference service

  1. Set up port forwarding to access the service locally.

    Important

    Port forwarding is for development and debugging only. It does not provide production-level reliability, security, or scalability. For production networking, see Manage Ingresses.

    kubectl port-forward svc/qwen2-chat-v1 8000:8000

    Expected output:

    Forwarding from 127.0.0.1:8000 -> 8000
    Forwarding from [::1]:8000 -> 8000
  2. Send a test inference request.

    curl -X POST localhost:8000/v2/models/ensemble/generate \
      -d '{"text_input": "What is machine learning?", "max_tokens": 20, "bad_words": "", "stop_words": "", "pad_id": 2, "end_id": 2}'

    Expected output:

    {"context_logits":0.0,"cum_log_probs":0.0,"generation_logits":0.0,"model_name":"ensemble","model_version":"1","output_log_probs":[0.0,0.0,0.0,0.0],"sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":" Machine learning is a type of artificial intelligence that allows computer systems to learn from data without being explicitly programmed."}

(Optional) Step 5: Clean up the environment

To delete the inference service, run:

arena serve delete qwen2-chat

Troubleshooting

The engine build runs out of GPU memory

Symptom: step2-trtllm-convert fails with an out-of-memory (OOM) error during the trtllm-build phase.

Check the pod logs for the specific error:

kubectl logs -l fluid.io/dataprocess=step2-trtllm-convert --tail=50

The trtllm-build command in this tutorial allocates one A10 GPU (24 GiB). If the GPU node is shared with other workloads, free up GPU memory before retrying. To retry from the failed step without re-running the download, delete only step2-trtllm-convert and re-create it — the script skips already-completed stages by checking for existing directories.

The OSS path is not found

Symptom: step1-download-model fails, or step2-trtllm-convert cannot find the model files at ${MODEL_MOUNT_PATH}/Qwen2-1.5B-Instruct.

Verify the Dataset mount is correct:

kubectl describe dataset qwen2-oss

Check that mountPoint in dataset.yaml matches your OSS bucket path (format: oss://<bucket-name>/qwen2-1.5b), and that fs.oss.endpoint is set to the endpoint for the region where your bucket is located.

A DataProcess job is stuck in a non-Complete phase

Symptom: kubectl get dataprocess shows a step in Pending or Running state for an unusually long time.

Describe the job to check for scheduling issues or errors:

kubectl describe dataprocess step2-trtllm-convert

Check the Events section for messages such as insufficient GPU resources, image pull failures, or node taints preventing scheduling. The engine build step (step2-trtllm-convert) typically takes 15–20 minutes; allow that time before investigating.

Appendix: Arena command parameter reference

The following table describes all parameters used in the arena serve custom command in Step 3.

Parameter Description Example
serve custom Arena subcommand. Deploys a custom model service instead of a preset type such as tfserving or triton.
--name Service name. A unique name for the service, used in subsequent management operations such as viewing logs and deleting the service. modelscope
--version Service version. A version label for the service, used in version management and phased releases. v1
--gpus GPU count. The number of GPUs allocated to each pod. Required when the model needs GPUs for inference. 1
--replicas Replica count. The number of service pods to run. Increasing replicas improves concurrent throughput and availability. 1
--restful-port RESTful port. The port on which the service exposes its REST API for inference requests. 8000
--readiness-probe-action Readiness probe type. Sets the check method for the Kubernetes readiness probe, which determines when the container is ready to receive traffic. tcpSocket
--readiness-probe-action-option Probe options. Parameters for the probe type. For tcpSocket, specifies the port to check. port: 8000
--readiness-probe-option Additional probe settings. Can be specified multiple times. Sets the initial delay and check interval. initialDelaySeconds: 30, periodSeconds: 30
--data Volume mount. Mounts a PVC to a path in the container. Format: PVC-name:mount-path. Used to mount model files from the Fluid-managed Dataset. llm-model:/Qwen1.5-4B-Chat
--image Container image. The full URL of the container image. Defines the runtime environment for the service. kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/quick-deploy-llm:v1
[COMMAND] Start command. The command to run after the container starts. Launches the Triton server with the model repository and port configuration. "MODEL_ID=/Qwen1.5-4B-Chat python3 server.py"