Configure pod security and prevent processes from escaping container boundaries and gaining permissions - Container Service for Kubernetes

In Container Service for Kubernetes, you can prevent processes in containers from escaping their isolation boundaries and accessing the host. This involves restricting containers from running in privileged mode, limiting application processes from running as the root user, and disabling the automatic mounting of Service Account tokens. You can configure pod security policies to harden your cluster's security and prevent it from becoming a target for attackers.

Prevent processes from escaping container boundaries and gaining permissions

If you are a developer or an O&M engineer who uses Kubernetes, it is crucial to prevent processes in containers from escaping their isolation boundaries and accessing the host. This is important for two main reasons:

First, processes inside a container run in the context of the Linux root user by default. Although the operations of the root user in the container are partially restricted by the Linux capabilities that Docker assigns to the container, these default permissions might allow an attacker to escalate privileges or access sensitive information on the host. This includes sensitive resources such as Secrets and ConfigMap objects. The following is a list of the default capabilities assigned to Docker containers. For more information, see capabilities(7) — Linux manual page.
cap_chown, cap_dac_override, cap_fowner, cap_fsetid, cap_kill, cap_setgid, cap_setuid, cap_setpcap, cap_net_bind_service, cap_net_raw, cap_sys_chroot, cap_mknod, cap_audit_write, cap_setfcap.
Avoid running pods with privileged status (privileged) whenever possible, because they have all the Linux capabilities associated with the root user on the host.
Second, all Kubernetes worker nodes use an authorization mode called the node authorizer. The node authorizer grants permissions for all API requests that originate from the Kubelet and allows a node to perform the following operations:
Read operations:
- Services
- Endpoints
- Nodes
- Pods
- Secrets, ConfigMaps, PersistentVolumes (PVs), and PersistentVolumeClaims (PVCs) related to pods that are bound to the Kubelet node
Write operations:
- Nodes and node status (Enable the NodeRestriction admission plugin to restrict a Kubelet to modifying only its own node object)
- Pods and pod status (Enable the NodeRestriction admission plugin to restrict a Kubelet to modifying only pods that are bound to itself)
- Events
Auth-related operations:
- Read/write access to the CertificateSigningRequest (CSR) API for TLS bootstrapping
- Ability to create TokenReview and SubjectAccessReview for delegated identity authentication and authorization checks

ACK clusters use the Node Restriction admission controller by default. This controller only allows a node to modify a limited set of node properties and pod objects that are bound to it. However, an attacker who manages to access the host can still retrieve sensitive information from the environment through the Kubernetes API. For more information, see Node Restriction admission controller.

Pod security configuration recommendations

Restrict containers from running in privileged mode
As mentioned earlier, containers that run with privileged status inherit all Linux capabilities assigned to the root user on the host. In most scenarios, containers do not need these permissions to function correctly. You can create a pod security policy to deny pods that are configured to run in privileged mode. A pod security policy is a set of security constraints that a pod must meet before it can be created. ACK provides container security policy capabilities based on Open Policy Agent (OPA) and Gatekeeper. These capabilities validate requests to create and update pods on the cluster based on user-configured security rules. If a request to create or update a pod does not comply with the defined rules, the system rejects the request and returns an error. Similarly, you can deploy the ACKPSPPrivilegedContainer policy to restrict the deployment of privileged containers in specified namespaces within the cluster.
Restrict application processes from running as the root user
By default, containers run as the root user. This can be a security problem if an attacker exploits a vulnerability in the application and obtains shell access to a running container. You can mitigate this risk in several ways. One way is to remove the shell from the container image. Another way is to add the USER instruction to your Dockerfile or run the container in the pod as a non-root user. The Kubernetes podSpec contains the spec.securityContext, runAsUser, and runAsGroup fields. These fields allow you to specify the user and group to run the application. You can create an ACKPSPAllowedUsers policy to enforce this restriction.
Forbid running containers in Docker-in-Docker mode or mounting Docker.sock in a container
Using nested containers or mounting Docker.sock makes it easy to build or run container images in a Docker container. However, this gives processes running inside the container control over the node. For more information about building container images on Kubernetes, see Build an image using an Enterprise Edition instance, Kaniko, and img.
Restrict the use of HostPath, and if you must use it, only mount directories with a specified prefix and configure the volume as read-only
You can use HostPath to directly mount a host directory into a container. This feature is required in only a few business scenarios. If your business requires this feature, you must understand the associated risks. By default, a pod running as root has write access to the file system exposed by HostPath. This access could allow an attacker to modify Kubelet settings, create symbolic links to directories or files not directly exposed through HostPath (such as /etc/shadow), install SSH keys, read secrets mounted on the host, and perform other malicious actions. To mitigate the risks of using HostPath, you can set the spec.containers.volumeMounts field to read-only. For example:
```
volumeMounts:
- name: hostPath-volume
  readOnly: true
  mountPath: /host-path
```
Similarly, you can deploy an ACKPSPHostFilesystem policy instance to limit the range of host directories that pods deployed in specified namespaces of the cluster are allowed to mount.
Set requests and resource limits for each container to avoid resource contention or DoS attacks
A pod with no requests or resource limits can theoretically consume all available resources on a host. If a pod is scheduled to this node, the node might experience CPU or memory shortages. This can cause the Kubelet to crash or evict other pods from the node. Although this cannot be completely avoided, setting requests and resource limits helps minimize resource contention. It also reduces the risk of excessive resource consumption from poorly written applications.
The podSpec lets you limit CPU and memory usage. You can enforce limits on requests and resources by setting a Resource Quota or creating a Limit Range on a namespace. A resource quota lets you specify the total amount of resources, such as CPU and RAM, allocated to a namespace. When applied to a namespace, it forces you to specify requests and limits for all containers deployed in that namespace. In contrast, a limit range provides more fine-grained control over resource allocation. With a limit range, you can set minimum and maximum CPU and memory resources for each pod or container within a namespace. You can also set default request and limit values if no values are provided. For more information, see Managing Resources for Containers.
Similarly, you can deploy an ACKContainerLimits policy instance to require that application pods deployed in specified namespaces of the cluster must have resource limits configured.
Disable privilege escalation configurations
Privilege escalation allows a process to change the security context in which it runs. Examples include sudo and binary files with the SUID or SGID bit. Privilege escalation allows a user to execute a file with the permissions of another user or group. You can prevent a container from escalating privileges using a pod security policy that sets allowPrivilegeEscalation to false, or by setting securityContext.allowPrivilegeEscalation in the podSpec.
Similarly, you can deploy an ACKPSPAllowPrivilegeEscalationContainer policy instance to require that pods deployed in specified namespaces of the cluster have the allowPrivilegeEscalation parameter configured.
Disable automatic mounting of Service Account tokens
For pods that do not need to access the Kubernetes API, you can disable the automatic mounting of ServiceAccount tokens on the PodSpec. You can also disable it for all pods that use a specific ServiceAccount.
```
apiVersion: v1
kind: Pod
metadata:
  name: pod-no-automount
spec:
  automountServiceAccountToken: false
...
```
Disabling automatic ServiceAccount mounting does not block the pod's network access to the Kubernetes API. To block network access for the pod to the Kubernetes API, you can modify the access method of the ACK cluster Endpoint and use a network policy to block the pod's network access. For more information, see Use network policies in an ACK cluster.
```
apiVersion: v1
kind: ServiceAccount
metadata:
  name: sa-no-automount
automountServiceAccountToken: false
...
```
Similarly, you can deploy an ACKBlockAutomountToken policy instance to require that the automountServiceAccountToken: false field is set in application pods to prevent the automatic mounting of a Service Account.
Disable service discovery
For pods that do not need to discover or call cluster services, you can limit the information provided to the pod. You can set the pod's DNS policy to not use CoreDNS and prevent Services in the namespace from being exposed as environment variables in the pod. For more information, see Environment variables.
The default value for a pod's DNS policy is ClusterFirst, which uses the in-cluster DNS. The value Default is not the default setting and configures the pod to use the DNS resolution of the underlying node. For more information, see Kubernetes docs on Pod DNS policy.
Disabling service links and changing the pod's DNS policy does not block the pod's network access to the in-cluster DNS service. An attacker can still enumerate services in the cluster by accessing the in-cluster DNS service (for example: dig SRV *.*.svc.cluster.local @$CLUSTER_DNS_IP). To block in-cluster service discovery, see Use network policies in an ACK cluster.
```
apiVersion: v1
kind: Pod
metadata:
  name: pod-no-service-info
spec:
    dnsPolicy: Default # The value "Default" is not the actual default.
    enableServiceLinks: false
...
```
Configure images to have a read-only file system
You can configure your image to have a read-only file system to prevent an attacker from overwriting files on the file system that your application uses. If your application must write to the file system, you can write to a temporary directory or mount an additional volume. You can enforce this by setting the pod's SecurityContext as follows:
```
...
securityContext:
  readOnlyRootFilesystem: true
...
```
Similarly, you can deploy an ACKPSPReadOnlyRootFilesystem policy instance to restrict pods deployed in specified namespaces of the cluster to use only a read-only root file system.