All Products
Search
Document Center

Container Service for Kubernetes:Securely Deploy vLLM Inference Services in ACK Heterogeneous Confidential Computing Clusters

Last Updated:Mar 26, 2026

Large Language Model (LLM) inference involves sensitive data and proprietary model weights. Running LLMs in untrusted environments risks exposing both. ACK Confidential AI (ACK-CAI) addresses this by integrating Intel® Trust Domain Extensions (TDX) and NVIDIA GPU trusted execution environment (TEE) hardware technologies to deliver end-to-end security for model inference.

With ACK-CAI, you can deploy vLLM inference services in ACK heterogeneous confidential computing clusters with the following protections:

  • Hardware-level isolation: Intel TDX and NVIDIA GPU TEE build a hardware-enforced trusted execution environment, protecting model weights and inference data during computation.

  • Remote attestation-based key distribution: The Trustee service cryptographically verifies the runtime environment before distributing model decryption keys—keys are released only to verified, trusted environments.

  • End-to-end encryption: A Trusted Network Gateway (TNG) establishes an encrypted channel from client to server, protecting inference requests and responses in transit.

  • Non-intrusive integration: A Kubernetes webhook automatically injects security components into pods based on annotations, with no changes required to your application code or images.

How it works

ACK-CAI provides confidential computing capabilities by injecting a set of sidecar containers called Trustiflux into application pods. Security is enforced through remote attestation, ensuring that models and inference data are accessible only inside a verified trusted environment.

image

Expand to view core component descriptions

Component Role
ACK heterogeneous confidential computing cluster A Kubernetes cluster built on TDX confidential instances with GPU confidential computing capabilities
Trustee remote attestation service Verifies the trustworthiness of the runtime environment and distributes model decryption keys after successful verification
Attestation Agent (AA) Runs inside the Trustiflux sidecar; performs remote attestation and retrieves decryption keys
Confidential Data Hub (CDH) Runs inside the Trustiflux sidecar; decrypts ciphertext model data
Trusted Network Gateway Server (TNG Server) Runs inside the Trustiflux sidecar; establishes and manages secure communication channels
Cachefs Runs inside the Trustiflux sidecar; decrypts the encrypted model files and mounts them into the inference container
Trusted Network Gateway Client (TNG Client) Runs on the client machine; establishes a secure channel to the cluster
Inference service The container that runs vLLM and serves model inference requests

Expand to view core security mechanisms

The solution enforces security through two mechanisms:

Remote attestation-based encrypted model distribution:

  1. When the pod starts, the Attestation Agent (AA) in the sidecar sends an attestation request to the Trustee service.

  2. Trustee verifies the trustworthiness of both the CPU (TDX) and GPU confidential environments.

  3. After successful verification, Trustee distributes the model decryption key to the pod.

  4. CDH and Cachefs use the key to decrypt the encrypted model files and mount them into the inference container.

Remote attestation-based end-to-end encrypted inference:

  1. The client-side inference program sends requests through the local TNG Client.

  2. Requests remain encrypted throughout transmission, preventing man-in-the-middle attacks.

  3. The TNG Server sidecar decrypts incoming requests, which are then processed by the inference service.

  4. The TNG Server encrypts the inference results and returns them to the client.

Prerequisites

Before you begin, make sure you have:

  • An Alibaba Cloud account with permissions to create ACK clusters and Elastic Compute Service (ECS) instances

  • A dedicated server (ECS instance or on-premises) to deploy the Trustee remote attestation service, with public network access and port 8081 open

  • A machine with kubectl and Helm installed, with access to the ACK cluster API server

  • Docker installed on the client machine used to access the inference service

  • (Optional) An Object Storage Service (OSS) bucket in the China (Beijing) region for storing encrypted model files

Deployment overview

Deploying a secure vLLM inference service involves six steps across different environments.

Tip: Step 1 (encrypting the model) is optional. If you want to test the solution quickly, skip to Step 2 and use the pre-encrypted sample models stored in a public-read OSS bucket.
Step Purpose Environment
Step 1: Prepare encrypted models Encrypt the inference model and upload it to OSS for secure static storage A dedicated data preparation server
Step 2: Deploy the Trustee remote attestation service Deploy the root-of-trust service that verifies environments and distributes decryption keys A dedicated standalone server outside the ACK cluster
Step 3: Configure the ACK confidential computing cluster Create the Kubernetes cluster and add confidential computing GPU nodes Alibaba Cloud console (ACK, ECS) and the ecs.gn8v-tee instance shell
Step 4: Deploy ACK-CAI components Install the CAI components that inject security capabilities into pods ACK console
Step 5: Deploy the vLLM model inference service Deploy the vLLM service using Helm, with confidential computing enabled via annotation A machine with kubectl and Helm configured
Step 6: Access the inference service securely Start the TNG client to access the deployed service through an encrypted channel Client environment

Step 1: Prepare encrypted models

This step encrypts your model and uploads the ciphertext to OSS, preparing it for secure remote distribution.

Run these commands on a temporary ECS instance in the same region as your OSS bucket to maximize upload speed over the private network.

Download a model

If you already have a model, skip to Encrypt the model.

This example uses Qwen2.5-3B-Instruct, which requires Python 3.9 or later. Install the ModelScope tool and download the model:

pip3 install modelscope importlib-metadata
modelscope download --model Qwen/Qwen2.5-3B-Instruct

The model downloads to ~/.cache/modelscope/hub/models/Qwen/Qwen2.5-3B-Instruct/.

Encrypt the model

ACK-CAI uses Gocryptfs encryption (based on AES-256-GCM). Only Gocryptfs v2.4.0 with default encryption parameters is supported.

  1. Install Gocryptfs using one of the following methods:

    Method 1 (recommended): Install from a yum repository

    If you are using Alibaba Cloud Linux 3 or AnolisOS 23, you can install gocryptfs directly from a yum repository.

    Alibaba Cloud Linux 3
    sudo yum install gocryptfs -y
    AnolisOS 23
    sudo yum install anolis-epao-release -y
    sudo yum install gocryptfs -y

    Method 2: Download the precompiled binary

    # Download Gocryptfs v2.4.0
    wget https://github.jobcher.com/gh/https://github.com/rfjakob/gocryptfs/releases/download/v2.4.0/gocryptfs_v2.4.0_linux-static_amd64.tar.gz
    
    # Extract and install
    tar xf gocryptfs_v2.4.0_linux-static_amd64.tar.gz
    sudo install -m 0755 ./gocryptfs /usr/local/bin
  2. Create the encryption key file. This key is uploaded to Trustee in a later step and used to decrypt the model at runtime. The example below uses 0Bn4Q1wwY9fN3P—use a randomly generated strong key in production.

    cat << EOF > ~/cachefs-password
    0Bn4Q1wwY9fN3P
    EOF
  3. Encrypt the model directory:

    1. Set the plaintext model path. Replace the path below if your model is stored elsewhere.

      PLAINTEXT_MODEL_PATH=~/.cache/modelscope/hub/models/Qwen/Qwen2.5-3B-Instruct/
    2. Initialize Gocryptfs and encrypt the model. After this completes, the encrypted model is stored in ~/mount/cipher.

      mkdir -p ~/mount cd ~/mount mkdir -p cipher plain # Install the FUSE runtime dependency sudo yum install -y fuse # Initialize the Gocryptfs encrypted directory cat ~/cachefs-password | gocryptfs -init cipher # Mount the encrypted directory cat ~/cachefs-password | gocryptfs cipher plain # Copy the model into the mounted plaintext directory cp -r ${PLAINTEXT_MODEL_PATH}/. ~/mount/plain

Upload the model to OSS

Create an OSS bucket in the same region where you plan to deploy the confidential computing cluster. Create a directory such as oss://examplebucket/qwen-encrypted/ to store the encrypted model. For setup instructions, see Get started with OSS.

Because model files are large, use ossbrowser to upload the ~/mount/cipher directory to OSS.

Step 2: Deploy the Trustee remote attestation service

Following the zero trust principle, every confidential computing environment must pass verification before gaining access to sensitive resources such as model decryption keys. Trustee acts as the root of trust: it verifies the runtime environment and distributes keys only after successful attestation.

Important

Deploy Trustee on a standalone server outside the ACK cluster. The Trustee owner must maintain full control over the deployment environment—if a cloud provider controls the host, the trust guarantee is weakened.

Choose a deployment option

Option Description When to use
ECS instance Create an ECS instance in the same VPC as the ACK cluster to run Trustee Standard production deployments requiring efficient private network communication
On-premises server Deploy Trustee in your own data center, connected to the cloud VPC via a leased line or VPN High-security scenarios where you require full control over the root-of-trust environment

Before proceeding, make sure the Trustee server has public network access and that port 8081 is open.

Install and start Trustee

Trustee is packaged in RPM format and available in the official yum repositories for Alibaba Cloud Linux 3.x and Anolis OS 8.x and later. After installation, systemd manages the service automatically.

  1. Install Trustee:

    For production deployments, configure HTTPS for the Trustee service.
    yum install trustee-1.5.2

    Trustee starts automatically and listens on port 8081. Access it at http://<trustee-ip>:8081/api, where <trustee-ip> is the IP address of the Trustee server.

  2. Verify that all service components are healthy. Install the jq tool first if needed (sudo yum install -y jq).

    # Replace <trustee-ip> with the Trustee server IP address
    curl http://<trustee-ip>:8081/api/services-health | jq

    All four components should show "status": "ok":

    {
      "gateway": {
        "status": "ok",
        "timestamp": "2025-08-26T13:46:13+08:00"
      },
      "kbs": {
        "status": "ok",
        "timestamp": "2025-08-26T13:46:13+08:00"
      },
      "as": {
        "status": "ok",
        "timestamp": "2025-08-26T13:46:13+08:00"
      },
      "rvps": {
        "status": "ok",
        "timestamp": "2025-08-26T13:46:13+08:00"
      }
    }

Expand to view common Trustee service management commands

The Trustee service is managed by systemd:

  • Start: systemctl start trustee

  • Stop: systemctl stop trustee

  • Restart: systemctl restart trustee

  • Check status: systemctl status trustee

Import the model decryption key

After Trustee is running, import the model decryption key. Trustee maps file paths to key IDs, so the key you store at a given path is addressable by a corresponding URI.

  1. Create the key directory and write the key:

    sudo mkdir -p /opt/trustee/kbs/repository/default/aliyun/
    # Replace <model decryption key> with your actual key (e.g., 0Bn4Q1wwY9fN3P)
    sudo sh -c 'echo -n "<model decryption key>" > /opt/trustee/kbs/repository/default/aliyun/model-decryption-key'
  2. Confirm the key ID. The key stored at .../aliyun/model-decryption-key maps to the following URI in Trustee:

    kbs:///default/aliyun/model-decryption-key

    Use this URI in the model-decryption-key-id field when configuring ACK-CAI in Step 5.

Step 3: Configure the ACK confidential computing cluster

This step creates an ACK cluster and adds ecs.gn8v-tee instances—which have both Intel TDX and NVIDIA GPU TEE capabilities enabled by default—as worker nodes.

gn8v-tee instance types have CPU and GPU confidential computing enabled by default. No additional configuration is needed to enable the confidential VM feature.
  1. Create an ACK managed cluster Pro edition in the China (Beijing) region. For instructions, see Create an ACK managed cluster.

  2. Create a node pool to manage the confidential computing instances. For instructions, see Create and manage node pools. Use the following settings:

    Setting Value
    vSwitch Select the virtual switch in China (Beijing) Zone L
    Scaling mode Keep the default (do not enable automatic elastic scaling)
    Instance type ecs.gn8v-tee.4xlarge or above
    Operating system Alibaba Cloud Linux 3.2104 LTS 64-bit
    System disk 100 GiB or more
    Expected number of nodes 0 (default)
    Node labels Key: ack.aliyun.com/nvidia-driver-version, Value: 550.144.03
  3. Create an ecs.gn8v-tee instance to use as a worker node. For instructions, see Custom purchase instances. Use the following settings:

    Setting Value
    Region China (Beijing)
    Network and zone Same VPC as the cluster, Zone L
    Instance type ecs.gn8v-tee.4xlarge or above
    Image Alibaba Cloud Linux 3.2104 LTS 64-bit
  4. Log in to the ecs.gn8v-tee instance and install the NVIDIA drivers and CUDA toolkit. For instructions, see Install NVIDIA drivers and CUDA toolkit.

  5. Add the instance to the node pool using manual addition. For instructions, see Add existing nodes.

Step 4: Deploy ACK-CAI components

ACK-CAI uses a Kubernetes webhook controller to automatically inject Trustiflux sidecar containers into pods. These sidecars provide remote attestation, model decryption, and secure communication—all without modifying your application.

  1. Log in to the Container Service Management Console. In the left navigation pane, click Clusters.

  2. On the Clusters page, click your cluster name. In the left navigation pane, click Applications > Helm.

  3. Click Create and follow the on-screen prompts to install the latest version of ACK-CAI. After installation, verify the deployment status in the Helm chart list.

Step 5: Deploy the vLLM model inference service

This step deploys the vLLM inference service using Helm. The deployment uses an annotation on the pod to signal ACK-CAI to inject the Trustiflux sidecar, enabling confidential computing protection.

Run these commands on a machine with kubectl and Helm configured and connected to the ACK cluster API server. You can also use Workbench or CloudShell.
  1. Create the Helm chart directory:

    mkdir -p ack-cai-vllm-demo
    cd ack-cai-vllm-demo
  2. Initialize the Helm chart. The chart schedules the vLLM pod on confidential computing GPU nodes and uses the CSI plugin to mount the encrypted model from OSS.

    Expand to view Helm Chart initialization script

    # Create the templates directory and vLLM manifest
    mkdir -p ./templates
    cat <<EOF >templates/vllm.yaml
    apiVersion: v1
    kind: PersistentVolume
    metadata:
      name: pv-oss
      namespace: {{ .Release.Namespace }}
      labels:
        alicloud-pvname: pv-oss
    spec:
      capacity:
        storage: 5Gi
      accessModes:
        - ReadOnlyMany
      persistentVolumeReclaimPolicy: Retain
      csi:
        driver: ossplugin.csi.alibabacloud.com
        volumeHandle: pv-oss
        volumeAttributes:
          bucket: {{ .Values.oss.bucket }}
          path: {{ .Values.oss.path }}
          url: {{ .Values.oss.url }}
          otherOpts: "-o umask=022 -o max_stat_cache_size=0 -o allow_other"
        nodePublishSecretRef:
          name: oss-secret
          namespace: {{ .Release.Namespace }}
    
    ---
    
    apiVersion: v1
    kind: Secret
    metadata:
      name: oss-secret
      namespace: {{ .Release.Namespace }}
    stringData:
      akId: {{ .Values.oss.akId }}
      akSecret: {{ .Values.oss.akSecret }}
    
    ---
    
    apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      name: pvc-oss
      namespace: {{ .Release.Namespace }}
    spec:
      accessModes:
        - ReadOnlyMany
      resources:
        requests:
          storage: 5Gi
      selector:
        matchLabels:
          alicloud-pvname: pv-oss
    
    ---
    
    apiVersion: v1
    kind: Service
    metadata:
      name: cai-vllm-svc
      namespace: {{ .Release.Namespace }}
      {{- if .Values.loadbalancer}}
      {{- if .Values.loadbalancer.aclId }}
      annotations:
        service.beta.kubernetes.io/alibaba-cloud-loadbalancer-acl-status: "on"
        service.beta.kubernetes.io/alibaba-cloud-loadbalancer-acl-id: {{ .Values.loadbalancer.aclId }}
        service.beta.kubernetes.io/alibaba-cloud-loadbalancer-acl-type: "white"
      {{- end }}
      {{- end }}
      labels:
        app: cai-vllm
    spec:
      ports:
      - port: 8080
        protocol: TCP
        targetPort: 8080
      selector:
        app: cai-vllm
      type: LoadBalancer
    
    ---
    
    apiVersion: v1
    kind: Pod
    metadata:
      name: cai-vllm
      namespace: {{ .Release.Namespace }}
      labels:
        app: cai-vllm
        trustiflux.alibaba.com/confidential-computing-mode: "ACK-CAI"
      annotations:
        trustiflux.alibaba.com/ack-cai-options: |
    {{ .Values.caiOptions | indent 6 }}
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: node.kubernetes.io/instance-type
                operator: In
                values:
                  - ecs.gn8v-tee.4xlarge
                  - ecs.gn8v-tee.6xlarge
                  - ecs.gn8v-tee-8x.16xlarge
                  - ecs.gn8v-tee-8x.48xlarge
      containers:
        - name: inference-service
          image: egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/llm-inference:vllm0.5.4-deepgpu-llm24.7-pytorch2.4.0-cuda12.4-ubuntu22.04
          command:
            - bash
          args: ["-c", "vllm serve /tmp/model --port 8080 --host 0.0.0.0 --served-model-name qwen2.5-3b-instruct --device cuda --dtype auto"]
          ports:
            - containerPort: 8080
          resources:
            limits:
              nvidia.com/gpu: 1  # Request 1 GPU card for this container
          volumeMounts:
            - name: pvc-oss
              mountPath: "/tmp/model"
    
      volumes:
        - name: pvc-oss
          persistentVolumeClaim:
            claimName: pvc-oss
    
    EOF
    
    # Create the Helm chart description file
    cat <<EOF > ./Chart.yaml
    apiVersion: v2
    name: vllm
    description: A test based on vllm for ack-cai
    type: application
    version: 0.1.0
    appVersion: "0.1.0"
    EOF
    
    # Create an empty values file
    touch values.yaml
  3. Edit values.yaml with your environment configuration. Replace <trustee-ip> with the Trustee server address and fill in your OSS details.

    caiOptions: |
      {
          "cipher-text-volume": "pvc-oss",
          "model-decryption-key-id" : "kbs:///default/aliyun/model-decryption-key",
          "trustee-address": "http://<trustee-ip>:8081/api"
      }
    oss:
      bucket: "conf-ai"                          # OSS bucket name where the encrypted model is stored
      path: "/qwen2.5-3b-gocryptfs/"             # Path to the encrypted model within the bucket
      url: "https://oss-cn-beijing-internal.aliyuncs.com"   # OSS endpoint
      akId: "xxxxx"                              # Alibaba Cloud AccessKey ID
      akSecret: "xxxxx"                          # Alibaba Cloud AccessKey Secret
  4. Deploy the vLLM service:

    helm install vllm . -n default
  5. Verify that the ACK-CAI sidecar containers are injected into the pod:

    kubectl get pod cai-vllm -n default -o jsonpath='{range .status.initContainerStatuses[*]}{.name}{"\t"}{range .state.running}Running{end}{.state.*.reason}{"\n"}{end}{range .status.containerStatuses[*]}{.name}{"\t"}{range .state.running}Running{end}{.state.*.reason}{"\n"}{end}'

    The output should list all five containers. Wait until they all show Running:

    cai-sidecar-attestation-agent   Running
    cai-sidecar-confidential-data-hub       Running
    cai-sidecar-tng Running
    cai-sidecar-cachefs     Running
    inference-service       Running
  6. Get the vLLM service address:

    kubectl get service cai-vllm-svc -o jsonpath='http://{.status.loadBalancer.ingress[0].ip}:{.spec.ports[0].port}{"\n"}'

    The output returns a URL in the format http://<vllm-ip>:8080. Save this address for Step 6.

Step 6: Access the inference service securely

Before sending inference requests, start a TNG Client on your machine to establish an encrypted channel. The TNG Client creates a local proxy on port 41000 that encrypts all outbound requests and decrypts incoming responses.

  1. Start the TNG Client:

    Replace <trustee-ip> with the Trustee server address.
    docker run -d \
        --network=host \
        confidential-ai-registry.cn-shanghai.cr.aliyuncs.com/product/tng:2.2.4 \
        tng launch --config-content '
          {
            "add_ingress": [
              {
                "http_proxy": {
                  "proxy_listen": {
                    "host": "0.0.0.0",
                    "port": 41000
                  }
                },
                "encap_in_http": {},
                "verify": {
                  "as_addr": "http://<trustee-ip>:8081/api/attestation-service/",
                  "policy_ids": [
                    "default"
                  ]
                }
              }
            ]
          }
    '
  2. Send a request through the TNG proxy. Replace <vllm-ip>:<port> with the service address obtained in Step 5.

    # Route requests through the local TNG proxy
    export http_proxy=http://127.0.0.1:41000
    
    curl http://<vllm-ip>:<port>/v1/completions \
      -H "Content-type: application/json" \
      -d '{
        "model": "qwen2.5-3b-instruct",
        "prompt": "San Francisco is a",
        "max_tokens": 7,
        "temperature": 0
        }'

Reference

caiOptions configuration

caiOptions is a JSON object passed as a pod annotation. ACK-CAI's admission webhook parses these parameters and configures the injected Trustiflux sidecar containers accordingly, enabling transparent model decryption, remote attestation, and encrypted networking.

Full configuration example:

{
  "cipher-text-volume": "pvc-oss",
  "model-decryption-key-id": "kbs:///default/aliyun/model-decryption-key",
  "trustee-address": "http://<trustee-ip>:8081/api",
  "aa-version": "1.3.1",
  "cdh-version": "1.3.1",
  "tng-version": "2.2.4",
  "cachefs-version": "1.0.7-2.6.1",
  "tdx-ra-enable": true,
  "gpu-ra-enable": true,
  "tng-http-secure-ports": [
    {
      "port": 8080
    }
  ]
}
Parameter Required Description
cipher-text-volume Required The PersistentVolumeClaim (PVC) name storing the encrypted model. ACK-CAI decrypts data from this PVC inside the trusted environment.
model-decryption-key-id Required The KBS URI of the model decryption key. Format: kbs:///<repository>/<group>/<key>.
trustee-address Required The Trustee service URL, used for remote attestation and key retrieval.
aa-version Optional The version of the Attestation Agent (AA) component.
cdh-version Optional The version of the Confidential Data Hub (CDH) component.
tng-version Optional The version of the Trusted Network Gateway (TNG) component.
cachefs-version Optional The version of the Cachefs component.
tdx-ra-enable Optional Enable remote attestation for the CPU (Intel TDX). Default: true. Setting this to false disables CPU environment verification and removes the hardware-level trust guarantee for the compute environment.
gpu-ra-enable Optional Enable remote attestation for the GPU. Default: true. Setting this to false disables GPU environment verification and removes the hardware-level trust guarantee for GPU workloads.
tng-http-secure-ports Optional Configure TNG to apply TLS encryption to traffic on specific HTTP ports. Accepts an array of port rules.

`tng-http-secure-ports` example:

"tng-http-secure-ports": [
  {
    "port": 8080,
    "allow-insecure-request-regexes": [
      "/api/builtin/.*"
    ]
  }
]
  • port: The HTTP service port that TNG encrypts.

  • allow-insecure-request-regexes: An array of path regex patterns. Requests whose paths match any pattern bypass TNG encryption.

Encrypted model sample files

The following pre-encrypted models are available for testing in a public-read OSS bucket. Gocryptfs-encrypted models use the encryption password 0Bn4Q1wwY9fN3P.

点此获取加密模型文件信息

Model Encryption method OSS region OSS endpoint Storage path
Qwen3-32B (Gocryptfs) Gocryptfs cn-beijing oss-cn-beijing-internal.aliyuncs.com conf-ai:/qwen3-32b-gocryptfs/
Qwen3-32B (SAM) SAM cn-beijing oss-cn-beijing-internal.aliyuncs.com conf-ai:/qwen3-32b-sam/
Qwen2.5-3B-Instruct (Gocryptfs) Gocryptfs cn-beijing oss-cn-beijing-internal.aliyuncs.com conf-ai:/qwen2.5-3b-gocryptfs/
Qwen2.5-3B-Instruct (SAM) SAM cn-beijing oss-cn-beijing-internal.aliyuncs.com conf-ai:/qwen2.5-3b-sam/