How to secure containerized AI applications in Kubernetes on Alibaba Cloud ACK clusters - Container Service for Kubernetes

AI applications often run on GPU nodes and process sensitive data. Their unique model file formats, such as pickle, and complex software supply chains make them targets for data theft, computing power abuse, and arbitrary code execution. You can build a defense-in-depth system for your AI applications by ensuring trusted sources during the build phase, implementing the principle of least privilege during deployment, and continuously monitoring applications at runtime.

Security risks and attack paths

Introduction to security risks

Compared to traditional applications, AI applications present unique security challenges. The key risks include the following:

Risks to core assets: Data and models
- Data breaches and poisoning: AI applications are often granted access to highly sensitive training data. If a container or pod is compromised, an attacker can directly read credentials from mounted volumes or environment variables, leading to a sensitive data breach. Attackers can also poison the data by tampering with the training dataset, which compromises model integrity and business reliability.
- Model file execution risks: Serialization formats such as pickle can execute arbitrary Python code during deserialization. If an AI application loads a model file from an untrusted source, or a file that is unverified or has been tampered with, it can trigger arbitrary code execution (RCE). This gives an attacker full control inside the container.
Risk of computing resource abuse
AI applications are often deployed on high-performance GPU nodes, such as NVIDIA A100 and H100. Their high computing density and long idle periods make them attractive targets for mining. Malicious programs can use GPU acceleration for hashing operations. This leads to resource abuse and high costs. Because AI training itself has high GPU utilization, it is difficult to detect this abnormal behavior using standard business metrics.
Infrastructure and supply chain risks
The combination of Kubernetes and machine learning toolchains, such as data pipelines and model registries, expands the attack surface. In addition, AI applications rely on public base images from Docker Hub, pre-trained models from Hugging Face, or third-party libraries. These dependencies may contain unpatched vulnerabilities or malicious code, creating typical entry points for software supply chain attacks.

Typical attack paths

Understanding common attack methods helps you build a targeted defense system. The following are common attack paths in AI clusters.

API operation attacks: An attacker can call a public inference API to launch a prompt injection attack. This can trick the model into leaking training data snippets or system prompts. The attacker can also send malformed requests, such as oversized images or long texts, to exhaust GPU memory and memory resources. This causes a Denial-of-Service (DoS) attack and impacts business availability.
Container image supply chain attacks: Attackers can publish malicious images to public image repositories, disguised as popular frameworks such as PyTorch or TensorFlow. They can also upload pre-trained models with back doors to model platforms such as Hugging Face. When a developer pulls and deploys these assets, the malicious code executes automatically at container startup. This establishes a persistent foothold for the attacker.

Kubernetes misconfigurations: Attackers exploit insecure container configurations to achieve a container escape. For example, running a container as the root user, mounting the containerd socket (such as /run/containerd/containerd.sock), or enabling privileged mode (privileged: true) allows an attacker to break out of the container, gain root access to the host, and take control of all nodes in the cluster.

Expand to view common security configurations and recommendations

Security configuration item	Insecure default configuration	Recommended configuration
User privileges	Runs as the root user by default. Excessive permissions inside the container can cause significant damage if exploited.	Use the `runAsNonRoot: true` and `runAsUser` fields to enforce running as a non-root user.
File system	The root file system is writable by default. An attacker can modify files or plant malicious programs.	Use the `readOnlyRootFilesystem: true` field to set the container file system to read-only.
Kernel capabilities	Some high-risk kernel permissions are granted by default. These permissions can be used to perform a container escape.	Use `capabilities.drop: ["ALL"]` to remove all non-essential kernel privileges.
ServiceAccount	The ServiceAccount token is automatically mounted by default. If the permissions are too high, a leaked token can lead to the entire cluster being compromised.	Use a dedicated, low-privilege ServiceAccount, and set `automountServiceAccountToken: false` to disable automatic mounting.

Lateral movement in the cluster: After compromising a pod, an attacker can read the default mounted ServiceAccount token located at /var/run/secrets/kubernetes.io/serviceaccount/token. If this token has high privileges, such as cluster-admin, the attacker can call the Kubernetes API to enumerate secrets, ConfigMaps, and pods, access internal services, and probe the cluster network topology, ultimately leading to a full cluster takeover.

Security hardening best practices

To address the risks and attack paths mentioned, follow the principle of defense in depth. Implement the following security practices in stages throughout the build, deployment, and runtime lifecycle of your AI applications.

Phase 1: Harden the software supply chain

Control the security of images and model files at the source to prevent malicious code from entering the production environment.

Comprehensive image scanning
Integrate ACR's Container Image Security Scan into your CI/CD pipeline. This feature supports the Trivy scan engine and the Security Center scan engine. It covers system vulnerabilities, application vulnerabilities, baseline checks, and malicious samples. Configure blocking policies to ensure that published images meet baseline security requirements.
Standardize model formats and signature verification
To eliminate the risk of arbitrary code execution from model files, standardize the model delivery format and establish a trusted distribution mechanism.
- Secure model formats: In production environments, avoid using the pickle format. Instead, use formats that do not have code execution capabilities, such as safetensors and ONNX. This reduces the attack surface for deserialization exploits.
- Artifact signature verification: Digitally sign delivered artifacts and enforce signature verification during deployment to ensure end-to-end integrity.
  - For container images, use ACR's container image signing to verify image integrity from build to runtime.
  - For general AI artifacts that follow the OCI specification, such as model files, use Notation and Ratify to sign and verify OCI artifacts. This process automatically blocks any unsigned or invalidly signed artifacts, effectively preventing man-in-the-middle tampering.
Use minimal base images
Use distroless base images. These images contain only the dependencies required for the application to run. They remove non-essential components such as shells and package managers to reduce the attack surface and limit an attacker's ability to execute commands or move laterally within the container.

Phase 2: Strengthen the Kubernetes runtime

Use the principle of least privilege and strong isolation mechanisms to limit what an attacker can do after a container is compromised. This helps block container escape and lateral movement paths.

Configure the pod security context (securityContext)

ACK supports enforcing built-in Kubernetes Pod Security Standards, such as the restricted policy. Configure the securityContext in the pod definition to systematically disable high-risk capabilities and make container escapes more difficult.

Configuration item	Recommended configuration	Description
Run user	`runAsUser: 1001` `runAsNonRoot: true`	Prevents the container from running as root, reducing the risk of escape.
File system	`readOnlyRootFilesystem: true`	Sets the root file system to read-only to prevent attackers from planting malicious files or modifying configurations.
Kernel capabilities	`capabilities.drop: ["ALL"]`	Removes all unnecessary Linux capabilities. Grant capabilities explicitly using `add` as needed to reduce the privileged attack surface.

Implement Role-Based Access Control (RBAC)
- Identity authentication: Combine Kubernetes RBAC with Alibaba Cloud RAM and STS. This lets you bind temporary, fine-grained cloud resource access permissions to pods. Avoid using long-term AccessKeys to reduce the risk of losing control over cloud resources due to credential leaks.
- Restrict ServiceAccounts: Create a dedicated ServiceAccount for each AI application and set automountServiceAccountToken: false. To call the Kubernetes API, explicitly mount the token using volumeMounts and limit its scope to a specific namespace. This reduces the credential exposure surface.
- Policy administration: Deploy the Gatekeeper admission controller and combine it with the OPA policy library. This lets you intercept non-compliant configurations in real time when resources are created, such as privileged containers, HostPath mounts, and writable root file systems. This achieves automated and strongly consistent security policies.
Enable sandboxed containers and network isolation
- Sandbox isolation: For tasks that run untrusted third-party models or high-risk code, use sandboxed containers. This solution provides an independent kernel and hardware-level isolation through a lightweight virtual machine. This blocks container escape paths and is suitable for scenarios that require a high degree of isolation.
- Network policy: Configure a NetworkPolicy to implement a default-deny policy and explicitly allow only the necessary communication between pods for your application. This limits an attacker's ability to move laterally within the cluster.
- Service mesh: Enable Service Mesh (ASM) and use its sidecar proxies to implement mTLS encrypted communication and authentication between services. This blocks lateral probing on the private network.

Phase 3: Implement monitoring and auditing

Establish end-to-end monitoring and auditing mechanisms to ensure that security events are discoverable and traceable.

Runtime behavior monitoring
ACK integrates with Security Center to provide real-time container protection. It automatically detects and generates alerts for abnormal behavior within containers, including the following:
- Malicious process startup: Detects the execution of malicious processes or high-risk commands, such as reverse shells, webshells, mining programs, and ransomware.
- Suspicious network connections: Detects container connections to miner pool ports or non-business-related public IP addresses.
- Credential theft: Detects unexpected read access to sensitive credential files, such as ServiceAccount tokens in the /var/run/secrets/ folder.
End-to-end log auditing
ACK supports API Server auditing, which delivers all API operation records to Simple Log Service (SLS). You can aggregate and analyze these logs to trace all API requests. Pay close attention to the following high-risk operations:
- Read requests for Secrets and ConfigMaps.
- kubectl exec commands to enter a container.
- Abnormal RBAC permission changes.

Container Service for Kubernetes:Best practices for securing containerized AI applications