This topic describes how to enforce pod security policies to protect your Container Service for Kubernetes (ACK) clusters.

Prevent container escapes that allow attackers to escalate privileges

Kubernetes developers or O&M administrators must focus on how to prevent container escapes that allow attackers to escalate privileges to control the host. Preventing container escapes is important due to the following reasons. By default, processes within a container run under the context of the [Linux] root user. The operations that the root user can perform are limited due to the Linux capabilities that are assigned by Docker to the container. However, an attacker can exploit default capabilities to escalate privileges or access sensitive information on the host, such as Secrets and ConfigMaps. The following code block shows the default capabilities that are assigned to a Docker container. For more information, see capabilities(7) - Linux manual page.

cap_chown, cap_dac_override, cap_fowner, cap_fsetid, cap_kill, cap_setgid, cap_setuid, cap_setpcap, cap_net_bind_service, cap_net_raw, cap_sys_chroot, cap_mknod, cap_audit_write, cap_setfcap.

To prevent container escapes, you must avoid running Docker containers with the privileged flag because a privileged container is assigned all Linux capabilities of the root user.

All Kubernetes worker nodes use the node authorizer, which is a special-purpose authorization mode. The node authorizer is used to authorize all API requests that are sent by a kubelet. The node authorizer also allows a node to perform the following operations:

Read operations:
  • Services
  • Endpoints
  • Nodes
  • Pods
  • Secrets, ConfigMaps, persistent volumes (PVs), and persistent volume claims (PVCs) of pods that are deployed on the node where the kubelet runs.
Write operations:
  • Nodes and node status. Enable the NodeRestriction admission plug-in to allow the kubelet to modify only the node where the kubelet runs.
  • Pods and pod status. Enable the NodeRestriction admission plug-in to allow the kubelet to modify only the pod that is deployed on the node where the kubelet runs.
  • Events
Authentication-related operations:
  • Read and write permissions on the CertificateSigningRequest (CSR) API for Transport Layer Security (TLS) bootstrapping.
  • Create TokenReview and SubjectAccessReview for reviewing delegated identity authentication and authorization.

By default, ACK clusters use the NodeRestriction admission controller. The NodeRestriction admission controller allows a kubelet to modify only the Node API object and Pod API objects that are bound to the node. However, the admission controller cannot prevent attackers from collecting sensitive information from the Kubernetes API. For more information, see NodeRestriction.

PodSecurityPolicy (PSP) is deprecated in Kubernetes 1.21. We recommend the users that use the PSP feature to find an alternative feature before the PSP feature is removed in Kubernetes 1.25. The Kubernetes community is developing a built-in admission controller to replace PSP. ACK will provide policy governance solutions that use the Open Policy Agent (OPA) to replace the PSP feature in later versions.

Suggestions on pod security

  • Limit the containers that can run in privileged mode

    Privileged containers inherit all Linux capabilities of the root user on the same host. In most scenarios, containers do not need these capabilities to handle workloads. You can create a pod security policy to forbid pods to run in privileged mode. The pod security policy is a group of constraints that a pod must meet before the pod can be created. The PodSecurityPolicy admission controller of Kubernetes validates requests for creating and updating pods in your cluster based on the rules that you configured. If a request for creating or updating a pod does not meet the rules, the request is rejected and an error is returned.

    By default, the PodSecurityPolicy admission controller is enabled for ACK clusters and a pod security policy named ack.privileged is created. The default pod security policy allows all types of pods. We recommend that you manage pod security policies by namespace based on the principle of least privilege. For example, privileged pods cannot be provisioned in a specified namespace, only read-only file systems can be used, or only host paths within a specified range can be mounted. The following code block shows an example of a pod security policy:

    apiVersion: policy/v1beta1
    kind: PodSecurityPolicy
    metadata:
        name: restricted
        annotations:
            seccomp.security.alpha.kubernetes.io/allowedProfileNames: 'docker/default,runtime/default'
            apparmor.security.beta.kubernetes.io/allowedProfileNames: 'runtime/default'
            seccomp.security.alpha.kubernetes.io/defaultProfileName:  'runtime/default'
            apparmor.security.beta.kubernetes.io/defaultProfileName:  'runtime/default'
    spec:
        privileged: false
        # Prevent escalations to root. 
        allowPrivilegeEscalation: false
        requiredDropCapabilities:
        - ALL
        # Allow core volume types. 
        volumes:
        - 'configMap'
        - 'emptyDir'
        - 'projected'
        - 'secret'
        - 'downwardAPI'
        # Assume that the PVs created by the cluster administrator are safe to use. 
        - 'persistentVolumeClaim'
        hostNetwork: false
        hostIPC: false
        hostPID: false
        runAsUser:
            # Require pods to run without root privileges. 
            rule: 'MustRunAsNonRoot'
        seLinux:
            # Assume that all nodes use AppArmor instead of SELinux. 
            rule: 'RunAsAny'
        supplementalGroups:
            rule: 'MustRunAs'
            ranges:
            # Forbid users to add the root group. 
            - min: 1
              max: 65535
        fsGroup:
            rule: 'MustRunAs'
            ranges:
            # Forbid users to add the root group. 
            - min: 1
              max: 65535
        readOnlyRootFilesystem: false

    The preceding policy can be used to prevent privileged pods or escalations to root. The policy can also be used to limit the types of volumes that can be mounted and forbid users to add the root group. For more information about how to enhance pod security by using pod security policies, see Pod Security Policies.

  • Run pods as a non-root user

    By default, all pods run with the root privileges. Attackers can exploit vulnerabilities in applications and then gain access to the shell of a pod. This poses risks to pod security. You can use multiple methods to mitigate the risks. You can delete the shell from the container image. You can also add the USER instruction to the Dockerfile or run the containers as a non-root user. The spec.securityContext attribute in the podSpec contains the runAsUser and runAsGroup fields. The two fields specify the users and groups for the containers that you want to run. You can create a pod security policy to forcibly enable these fields. For more information, see Users and groups.

  • Forbid users to run Docker-in-Docker containers or mount Docker.sock to containers

    You can build or deploy container images inside a Docker container by using the Docker-in-Docker method or mounting Docker.sock. However, the process that runs inside the container gains access to the node. For more information about building container images on Kubernetes, see Use Container Registry Enterprise Edition to build images, Kaniko, and img.

  • Limit the use of hostPath volumes, or allow only hostPath volumes that are set to read-only and mount paths that start with specified prefixes
    A hostPath volume mounts a path from the host to a pod. In most cases, pods do not require hostPath volumes. Make sure that you understand the risks if you want to use hostPath volumes. By default, pods that run with the root privileges have read permissions on file systems that are exposed by using hostPath volumes. Attackers can modify the kubelet settings, and then create symbolic links to paths or files that are not directly exposed by hostPath volumes. For example, attackers can access /etc/shadow, install SSH keys, read Secrets that are mounted to the host, and perform other malicious activities. To mitigate the risks that arise from hostPath volumes, set spec.containers.volumeMounts to read-only. Example:
    volumeMounts:
    - name: hostPath-volume
        readOnly: true
        mountPath: /host-path
    You can also use a pod security policy to limit the paths that can be mounted by using hostPath volumes. For example, the following pod security policy specifies that only paths that start with /foo on the host can be mounted.
    allowedHostPaths:
    # This allows "/foo", "/foo/", "/foo/bar" etc., but
    # disallows "/fool", "/etc/foo" etc.
    # "/foo/../" is never valid.
    - pathPrefix: "/foo"
        readOnly: true # only allow read-only mounts
  • Set resource requests and limits for each container to avoid resource competition or prevent DoS attacks

    A pod without resource requests or limits can consume all of the resources on a host. If additional pods are scheduled to a node, the CPU or memory resources of the node may be exhausted. As a result, the kubelet may crash or pods may be evicted from the node. This issue is inevitable. However, you can set resource requests and limits to minimize resource competition and reduce the risks from improperly programmed applications that consume excessive resources.

    You can specify requests and limits for CPU and memory resources in the podSpec. You can set a resource quota or limit range on a namespace to force the use of requests and limits. A resource quota specifies the total amount of resources that are allocated to a namespace, such as CPU and memory resources. After you apply a resource quota to a namespace, the resource quota forces you to specify requests and limits for all containers deployed in the namespace. A limit range can be used to enforce fine-grained control on the resources that are allocated. You can set limit ranges to specify the maximum and minimum amounts of CPU and memory resources that each pod or container in a namespace can use. You can also use limit ranges to set the default request values or limit values if no default values are provided. For more information, see Managing Resources for Containers.

  • Forbid privileged escalation

    Privileged escalation allows a process to change the security context under which it runs. For example, sudo files are binary files with the SUID or SGID bit. Privileged escalation is a method that can be used by a user to execute a file with the permissions of another user or user group. To prevent privileged escalation, you can use a pod security policy that has allowPriviledgedEscalation set to false or specify securityContext.allowPrivilegedEscalation in the podSpec.

  • Disable automatic ServiceAccount token mounting
    For pods that do not need to access the Kubernetes API, you can disable automatic ServiceAccount token mounting in the PodSpec of specific pods, or disable this feature for all pods that use a specific ServiceAccount.
    apiVersion: v1
    kind: Pod
    metadata:
      name: pod-no-automount
    spec:
      automountServiceAccountToken: false
    After you disable automatic ServiceAccount token mounting for a pod, the pod can still access the Kubernetes API. To prevent a pod from accessing the Kubernetes API, you must regulate access control on the endpoint of the ACK cluster and configure network policies to block the pod. For more information, see Use network policies.
    apiVersion: v1
    kind: ServiceAccount
    metadata:
      name: sa-no-automount
    automountServiceAccountToken: false
  • Disable service discovery

    You can reduce the amount of information provided to a pod if the pod does not need to look up and call cluster services. You can modify the CoreDNS policy of a pod to not use CoreDNS and to not expose Services as environment variables in the namespace of the pod. For more information, see Environment variables.

    By default, the DNS policy of a pod is set to ClusterFirst, which requires the pod to use the in-cluster DNS service. If the DNS policy is set to Default, the pod is required to use the DNS resolution configurations from the underlying node. For more information, see Pod DNS policy.

    After you disable service links and change the DNS policy of a pod, the pod can still access the in-cluster DNS service. Attackers can enumerate Services in an ACK cluster by accessing the in-cluster DNS service. Example: dig SRV *.*.svc.cluster.local @$CLUSTER_DNS_IP. For more information about how to prevent service discovery within a cluster, see Use network policies.

    apiVersion: v1
    kind: Pod
    metadata:
      name: pod-no-service-info
    spec:
        dnsPolicy: Default # The value Default is not the default setting of a DNS policy. 
        enableServiceLinks: false
  • Configure container images to use a read-only file system
    You can configure container images to use a read-only file system to prevent attackers from overwriting files in the file system that is used by your application. If your application must write data to the file system, you can set the application to write to a temporary directory or mount a volume to the application. You can configure container images to use a read-only file system by setting the following pod SecurityContext:
    ...
    securityContext:
      readOnlyRootFilesystem: true
    ...