Large Language Model (LLM) inference involves sensitive data and proprietary model weights. Running LLMs in untrusted environments risks exposing both. ACK Confidential AI (ACK-CAI) addresses this by integrating Intel® Trust Domain Extensions (TDX) and NVIDIA GPU trusted execution environment (TEE) hardware technologies to deliver end-to-end security for model inference.
With ACK-CAI, you can deploy vLLM inference services in ACK heterogeneous confidential computing clusters with the following protections:
-
Hardware-level isolation: Intel TDX and NVIDIA GPU TEE build a hardware-enforced trusted execution environment, protecting model weights and inference data during computation.
-
Remote attestation-based key distribution: The Trustee service cryptographically verifies the runtime environment before distributing model decryption keys—keys are released only to verified, trusted environments.
-
End-to-end encryption: A Trusted Network Gateway (TNG) establishes an encrypted channel from client to server, protecting inference requests and responses in transit.
-
Non-intrusive integration: A Kubernetes webhook automatically injects security components into pods based on annotations, with no changes required to your application code or images.
How it works
ACK-CAI provides confidential computing capabilities by injecting a set of sidecar containers called Trustiflux into application pods. Security is enforced through remote attestation, ensuring that models and inference data are accessible only inside a verified trusted environment.
Prerequisites
Before you begin, make sure you have:
-
An Alibaba Cloud account with permissions to create ACK clusters and Elastic Compute Service (ECS) instances
-
A dedicated server (ECS instance or on-premises) to deploy the Trustee remote attestation service, with public network access and port 8081 open
-
A machine with kubectl and Helm installed, with access to the ACK cluster API server
-
Docker installed on the client machine used to access the inference service
-
(Optional) An Object Storage Service (OSS) bucket in the China (Beijing) region for storing encrypted model files
Deployment overview
Deploying a secure vLLM inference service involves six steps across different environments.
Tip: Step 1 (encrypting the model) is optional. If you want to test the solution quickly, skip to Step 2 and use the pre-encrypted sample models stored in a public-read OSS bucket.
| Step | Purpose | Environment |
|---|---|---|
| Step 1: Prepare encrypted models | Encrypt the inference model and upload it to OSS for secure static storage | A dedicated data preparation server |
| Step 2: Deploy the Trustee remote attestation service | Deploy the root-of-trust service that verifies environments and distributes decryption keys | A dedicated standalone server outside the ACK cluster |
| Step 3: Configure the ACK confidential computing cluster | Create the Kubernetes cluster and add confidential computing GPU nodes | Alibaba Cloud console (ACK, ECS) and the ecs.gn8v-tee instance shell |
| Step 4: Deploy ACK-CAI components | Install the CAI components that inject security capabilities into pods | ACK console |
| Step 5: Deploy the vLLM model inference service | Deploy the vLLM service using Helm, with confidential computing enabled via annotation | A machine with kubectl and Helm configured |
| Step 6: Access the inference service securely | Start the TNG client to access the deployed service through an encrypted channel | Client environment |
Step 1: Prepare encrypted models
This step encrypts your model and uploads the ciphertext to OSS, preparing it for secure remote distribution.
Run these commands on a temporary ECS instance in the same region as your OSS bucket to maximize upload speed over the private network.
Download a model
If you already have a model, skip to Encrypt the model.
This example uses Qwen2.5-3B-Instruct, which requires Python 3.9 or later. Install the ModelScope tool and download the model:
pip3 install modelscope importlib-metadata
modelscope download --model Qwen/Qwen2.5-3B-Instruct
The model downloads to ~/.cache/modelscope/hub/models/Qwen/Qwen2.5-3B-Instruct/.
Encrypt the model
ACK-CAI uses Gocryptfs encryption (based on AES-256-GCM). Only Gocryptfs v2.4.0 with default encryption parameters is supported.
-
Install Gocryptfs using one of the following methods:
Method 1 (recommended): Install from a yum repository
If you are using Alibaba Cloud Linux 3 or AnolisOS 23, you can install gocryptfs directly from a yum repository.
Alibaba Cloud Linux 3
sudo yum install gocryptfs -yAnolisOS 23
sudo yum install anolis-epao-release -y sudo yum install gocryptfs -yMethod 2: Download the precompiled binary
# Download Gocryptfs v2.4.0 wget https://github.jobcher.com/gh/https://github.com/rfjakob/gocryptfs/releases/download/v2.4.0/gocryptfs_v2.4.0_linux-static_amd64.tar.gz # Extract and install tar xf gocryptfs_v2.4.0_linux-static_amd64.tar.gz sudo install -m 0755 ./gocryptfs /usr/local/bin -
Create the encryption key file. This key is uploaded to Trustee in a later step and used to decrypt the model at runtime. The example below uses
0Bn4Q1wwY9fN3P—use a randomly generated strong key in production.cat << EOF > ~/cachefs-password 0Bn4Q1wwY9fN3P EOF -
Encrypt the model directory:
-
Set the plaintext model path. Replace the path below if your model is stored elsewhere.
PLAINTEXT_MODEL_PATH=~/.cache/modelscope/hub/models/Qwen/Qwen2.5-3B-Instruct/ -
Initialize Gocryptfs and encrypt the model. After this completes, the encrypted model is stored in
~/mount/cipher.mkdir -p ~/mount cd ~/mount mkdir -p cipher plain # Install the FUSE runtime dependency sudo yum install -y fuse # Initialize the Gocryptfs encrypted directory cat ~/cachefs-password | gocryptfs -init cipher # Mount the encrypted directory cat ~/cachefs-password | gocryptfs cipher plain # Copy the model into the mounted plaintext directory cp -r ${PLAINTEXT_MODEL_PATH}/. ~/mount/plain
-
Upload the model to OSS
Create an OSS bucket in the same region where you plan to deploy the confidential computing cluster. Create a directory such as oss://examplebucket/qwen-encrypted/ to store the encrypted model. For setup instructions, see Get started with OSS.
Because model files are large, use ossbrowser to upload the ~/mount/cipher directory to OSS.
Step 2: Deploy the Trustee remote attestation service
Following the zero trust principle, every confidential computing environment must pass verification before gaining access to sensitive resources such as model decryption keys. Trustee acts as the root of trust: it verifies the runtime environment and distributes keys only after successful attestation.
Deploy Trustee on a standalone server outside the ACK cluster. The Trustee owner must maintain full control over the deployment environment—if a cloud provider controls the host, the trust guarantee is weakened.
Choose a deployment option
| Option | Description | When to use |
|---|---|---|
| ECS instance | Create an ECS instance in the same VPC as the ACK cluster to run Trustee | Standard production deployments requiring efficient private network communication |
| On-premises server | Deploy Trustee in your own data center, connected to the cloud VPC via a leased line or VPN | High-security scenarios where you require full control over the root-of-trust environment |
Before proceeding, make sure the Trustee server has public network access and that port 8081 is open.
Install and start Trustee
Trustee is packaged in RPM format and available in the official yum repositories for Alibaba Cloud Linux 3.x and Anolis OS 8.x and later. After installation, systemd manages the service automatically.
-
Install Trustee:
For production deployments, configure HTTPS for the Trustee service.
yum install trustee-1.5.2Trustee starts automatically and listens on port 8081. Access it at
http://<trustee-ip>:8081/api, where<trustee-ip>is the IP address of the Trustee server. -
Verify that all service components are healthy. Install the
jqtool first if needed (sudo yum install -y jq).# Replace <trustee-ip> with the Trustee server IP address curl http://<trustee-ip>:8081/api/services-health | jqAll four components should show
"status": "ok":{ "gateway": { "status": "ok", "timestamp": "2025-08-26T13:46:13+08:00" }, "kbs": { "status": "ok", "timestamp": "2025-08-26T13:46:13+08:00" }, "as": { "status": "ok", "timestamp": "2025-08-26T13:46:13+08:00" }, "rvps": { "status": "ok", "timestamp": "2025-08-26T13:46:13+08:00" } }
Import the model decryption key
After Trustee is running, import the model decryption key. Trustee maps file paths to key IDs, so the key you store at a given path is addressable by a corresponding URI.
-
Create the key directory and write the key:
sudo mkdir -p /opt/trustee/kbs/repository/default/aliyun/ # Replace <model decryption key> with your actual key (e.g., 0Bn4Q1wwY9fN3P) sudo sh -c 'echo -n "<model decryption key>" > /opt/trustee/kbs/repository/default/aliyun/model-decryption-key' -
Confirm the key ID. The key stored at
.../aliyun/model-decryption-keymaps to the following URI in Trustee:kbs:///default/aliyun/model-decryption-keyUse this URI in the
model-decryption-key-idfield when configuring ACK-CAI in Step 5.
Step 3: Configure the ACK confidential computing cluster
This step creates an ACK cluster and adds ecs.gn8v-tee instances—which have both Intel TDX and NVIDIA GPU TEE capabilities enabled by default—as worker nodes.
gn8v-tee instance types have CPU and GPU confidential computing enabled by default. No additional configuration is needed to enable the confidential VM feature.
-
Create an ACK managed cluster Pro edition in the China (Beijing) region. For instructions, see Create an ACK managed cluster.
-
Create a node pool to manage the confidential computing instances. For instructions, see Create and manage node pools. Use the following settings:
Setting Value vSwitch Select the virtual switch in China (Beijing) Zone L Scaling mode Keep the default (do not enable automatic elastic scaling) Instance type ecs.gn8v-tee.4xlarge or above Operating system Alibaba Cloud Linux 3.2104 LTS 64-bit System disk 100 GiB or more Expected number of nodes 0 (default) Node labels Key: ack.aliyun.com/nvidia-driver-version, Value:550.144.03 -
Create an ecs.gn8v-tee instance to use as a worker node. For instructions, see Custom purchase instances. Use the following settings:
Setting Value Region China (Beijing) Network and zone Same VPC as the cluster, Zone L Instance type ecs.gn8v-tee.4xlarge or above Image Alibaba Cloud Linux 3.2104 LTS 64-bit -
Log in to the ecs.gn8v-tee instance and install the NVIDIA drivers and CUDA toolkit. For instructions, see Install NVIDIA drivers and CUDA toolkit.
-
Add the instance to the node pool using manual addition. For instructions, see Add existing nodes.
Step 4: Deploy ACK-CAI components
ACK-CAI uses a Kubernetes webhook controller to automatically inject Trustiflux sidecar containers into pods. These sidecars provide remote attestation, model decryption, and secure communication—all without modifying your application.
-
Log in to the Container Service Management Console. In the left navigation pane, click Clusters.
-
On the Clusters page, click your cluster name. In the left navigation pane, click Applications > Helm.
-
Click Create and follow the on-screen prompts to install the latest version of ACK-CAI. After installation, verify the deployment status in the Helm chart list.
Step 5: Deploy the vLLM model inference service
This step deploys the vLLM inference service using Helm. The deployment uses an annotation on the pod to signal ACK-CAI to inject the Trustiflux sidecar, enabling confidential computing protection.
Run these commands on a machine with kubectl and Helm configured and connected to the ACK cluster API server. You can also use Workbench or CloudShell.
-
Create the Helm chart directory:
mkdir -p ack-cai-vllm-demo cd ack-cai-vllm-demo -
Initialize the Helm chart. The chart schedules the vLLM pod on confidential computing GPU nodes and uses the CSI plugin to mount the encrypted model from OSS.
-
Edit
values.yamlwith your environment configuration. Replace<trustee-ip>with the Trustee server address and fill in your OSS details.caiOptions: | { "cipher-text-volume": "pvc-oss", "model-decryption-key-id" : "kbs:///default/aliyun/model-decryption-key", "trustee-address": "http://<trustee-ip>:8081/api" } oss: bucket: "conf-ai" # OSS bucket name where the encrypted model is stored path: "/qwen2.5-3b-gocryptfs/" # Path to the encrypted model within the bucket url: "https://oss-cn-beijing-internal.aliyuncs.com" # OSS endpoint akId: "xxxxx" # Alibaba Cloud AccessKey ID akSecret: "xxxxx" # Alibaba Cloud AccessKey Secret -
Deploy the vLLM service:
helm install vllm . -n default -
Verify that the ACK-CAI sidecar containers are injected into the pod:
kubectl get pod cai-vllm -n default -o jsonpath='{range .status.initContainerStatuses[*]}{.name}{"\t"}{range .state.running}Running{end}{.state.*.reason}{"\n"}{end}{range .status.containerStatuses[*]}{.name}{"\t"}{range .state.running}Running{end}{.state.*.reason}{"\n"}{end}'The output should list all five containers. Wait until they all show
Running:cai-sidecar-attestation-agent Running cai-sidecar-confidential-data-hub Running cai-sidecar-tng Running cai-sidecar-cachefs Running inference-service Running -
Get the vLLM service address:
kubectl get service cai-vllm-svc -o jsonpath='http://{.status.loadBalancer.ingress[0].ip}:{.spec.ports[0].port}{"\n"}'The output returns a URL in the format
http://<vllm-ip>:8080. Save this address for Step 6.
Step 6: Access the inference service securely
Before sending inference requests, start a TNG Client on your machine to establish an encrypted channel. The TNG Client creates a local proxy on port 41000 that encrypts all outbound requests and decrypts incoming responses.
-
Start the TNG Client:
Replace
<trustee-ip>with the Trustee server address.docker run -d \ --network=host \ confidential-ai-registry.cn-shanghai.cr.aliyuncs.com/product/tng:2.2.4 \ tng launch --config-content ' { "add_ingress": [ { "http_proxy": { "proxy_listen": { "host": "0.0.0.0", "port": 41000 } }, "encap_in_http": {}, "verify": { "as_addr": "http://<trustee-ip>:8081/api/attestation-service/", "policy_ids": [ "default" ] } } ] } ' -
Send a request through the TNG proxy. Replace
<vllm-ip>:<port>with the service address obtained in Step 5.# Route requests through the local TNG proxy export http_proxy=http://127.0.0.1:41000 curl http://<vllm-ip>:<port>/v1/completions \ -H "Content-type: application/json" \ -d '{ "model": "qwen2.5-3b-instruct", "prompt": "San Francisco is a", "max_tokens": 7, "temperature": 0 }'
Reference
caiOptions configuration
caiOptions is a JSON object passed as a pod annotation. ACK-CAI's admission webhook parses these parameters and configures the injected Trustiflux sidecar containers accordingly, enabling transparent model decryption, remote attestation, and encrypted networking.
Full configuration example:
{
"cipher-text-volume": "pvc-oss",
"model-decryption-key-id": "kbs:///default/aliyun/model-decryption-key",
"trustee-address": "http://<trustee-ip>:8081/api",
"aa-version": "1.3.1",
"cdh-version": "1.3.1",
"tng-version": "2.2.4",
"cachefs-version": "1.0.7-2.6.1",
"tdx-ra-enable": true,
"gpu-ra-enable": true,
"tng-http-secure-ports": [
{
"port": 8080
}
]
}
| Parameter | Required | Description |
|---|---|---|
cipher-text-volume |
Required | The PersistentVolumeClaim (PVC) name storing the encrypted model. ACK-CAI decrypts data from this PVC inside the trusted environment. |
model-decryption-key-id |
Required | The KBS URI of the model decryption key. Format: kbs:///<repository>/<group>/<key>. |
trustee-address |
Required | The Trustee service URL, used for remote attestation and key retrieval. |
aa-version |
Optional | The version of the Attestation Agent (AA) component. |
cdh-version |
Optional | The version of the Confidential Data Hub (CDH) component. |
tng-version |
Optional | The version of the Trusted Network Gateway (TNG) component. |
cachefs-version |
Optional | The version of the Cachefs component. |
tdx-ra-enable |
Optional | Enable remote attestation for the CPU (Intel TDX). Default: true. Setting this to false disables CPU environment verification and removes the hardware-level trust guarantee for the compute environment. |
gpu-ra-enable |
Optional | Enable remote attestation for the GPU. Default: true. Setting this to false disables GPU environment verification and removes the hardware-level trust guarantee for GPU workloads. |
tng-http-secure-ports |
Optional | Configure TNG to apply TLS encryption to traffic on specific HTTP ports. Accepts an array of port rules. |
`tng-http-secure-ports` example:
"tng-http-secure-ports": [
{
"port": 8080,
"allow-insecure-request-regexes": [
"/api/builtin/.*"
]
}
]
-
port: The HTTP service port that TNG encrypts. -
allow-insecure-request-regexes: An array of path regex patterns. Requests whose paths match any pattern bypass TNG encryption.
Encrypted model sample files
The following pre-encrypted models are available for testing in a public-read OSS bucket. Gocryptfs-encrypted models use the encryption password 0Bn4Q1wwY9fN3P.