This topic describes how to enforce pod security policies to protect your Container Service for Kubernetes (ACK) clusters.
Prevent container escapes that allow attackers to escalate privileges
Kubernetes developers or O&M administrators must focus on how to prevent container
escapes that allow attackers to escalate privileges to control the host. Preventing
container escapes is important due to the following reasons. By default, processes
within a container run under the context of the
[Linux] root user. The operations that the
root user can perform are limited due to the
Linux capabilities that are assigned by Docker to the container. However, an attacker can exploit default
capabilities to escalate privileges or access sensitive information on the host, such
ConfigMaps. The following code block shows the default
capabilities that are assigned to a Docker container. For more information, see capabilities(7) - Linux manual page.
cap_chown, cap_dac_override, cap_fowner, cap_fsetid, cap_kill, cap_setgid, cap_setuid,
cap_setpcap, cap_net_bind_service, cap_net_raw, cap_sys_chroot, cap_mknod, cap_audit_write,
To prevent container escapes, you must avoid running Docker containers with the
privileged flag because a privileged container is assigned all Linux
capabilities of the root user.
All Kubernetes worker nodes use the node authorizer, which is a special-purpose authorization mode. The node authorizer is used to authorize all API requests that are sent by a kubelet. The node authorizer also allows a node to perform the following operations:
- Secrets, ConfigMaps, persistent volumes (PVs), and persistent volume claims (PVCs) of pods that are deployed on the node where the kubelet runs.
- Nodes and node status. Enable the
NodeRestrictionadmission plug-in to allow the kubelet to modify only the node where the kubelet runs.
- Pods and pod status. Enable the
NodeRestrictionadmission plug-in to allow the kubelet to modify only the pod that is deployed on the node where the kubelet runs.
- Read and write permissions on the
CertificateSigningRequest (CSR)API for Transport Layer Security (TLS) bootstrapping.
SubjectAccessReviewfor reviewing delegated identity authentication and authorization.
By default, ACK clusters use the NodeRestriction admission controller. The NodeRestriction admission controller allows a kubelet to modify only the Node API object and Pod API objects that are bound to the node. However, the admission controller cannot prevent attackers from collecting sensitive information from the Kubernetes API. For more information, see NodeRestriction.
PodSecurityPolicy (PSP) is deprecated in Kubernetes 1.21. We recommend the users that use the PSP feature to find an alternative feature before the PSP feature is removed in Kubernetes 1.25. The Kubernetes community is developing a built-in admission controller to replace PSP. ACK will provide policy governance solutions that use the Open Policy Agent (OPA) to replace the PSP feature in later versions.
Suggestions on pod security
- Limit the containers that can run in privileged mode
Privileged containers inherit all Linux capabilities of the root user on the same host. In most scenarios, containers do not need these capabilities to handle workloads. You can create a pod security policy to forbid pods to run in privileged mode. The pod security policy is a group of constraints that a pod must meet before the pod can be created. The PodSecurityPolicy admission controller of Kubernetes validates requests for creating and updating pods in your cluster based on the rules that you configured. If a request for creating or updating a pod does not meet the rules, the request is rejected and an error is returned.
By default, the PodSecurityPolicy admission controller is enabled for ACK clusters and a pod security policy named
ack.privilegedis created. The default pod security policy allows all types of pods. We recommend that you manage pod security policies by namespace based on the principle of least privilege. For example, privileged pods cannot be provisioned in a specified namespace, only read-only file systems can be used, or only host paths within a specified range can be mounted. The following code block shows an example of a pod security policy:
apiVersion: policy/v1beta1 kind: PodSecurityPolicy metadata: name: restricted annotations: seccomp.security.alpha.kubernetes.io/allowedProfileNames: 'docker/default,runtime/default' apparmor.security.beta.kubernetes.io/allowedProfileNames: 'runtime/default' seccomp.security.alpha.kubernetes.io/defaultProfileName: 'runtime/default' apparmor.security.beta.kubernetes.io/defaultProfileName: 'runtime/default' spec: privileged: false # Prevent escalations to root. allowPrivilegeEscalation: false requiredDropCapabilities: - ALL # Allow core volume types. volumes: - 'configMap' - 'emptyDir' - 'projected' - 'secret' - 'downwardAPI' # Assume that the PVs created by the cluster administrator are safe to use. - 'persistentVolumeClaim' hostNetwork: false hostIPC: false hostPID: false runAsUser: # Require pods to run without root privileges. rule: 'MustRunAsNonRoot' seLinux: # Assume that all nodes use AppArmor instead of SELinux. rule: 'RunAsAny' supplementalGroups: rule: 'MustRunAs' ranges: # Forbid users to add the root group. - min: 1 max: 65535 fsGroup: rule: 'MustRunAs' ranges: # Forbid users to add the root group. - min: 1 max: 65535 readOnlyRootFilesystem: false
The preceding policy can be used to prevent privileged pods or escalations to root. The policy can also be used to limit the types of volumes that can be mounted and forbid users to add the root group. For more information about how to enhance pod security by using pod security policies, see Pod Security Policies.
- Run pods as a non-root user
By default, all pods run with the
rootprivileges. Attackers can exploit vulnerabilities in applications and then gain access to the
shellof a pod. This poses risks to pod security. You can use multiple methods to mitigate the risks. You can delete the
shellfrom the container image. You can also add the
USERinstruction to the
Dockerfileor run the containers as a
spec.securityContextattribute in the
runAsGroupfields. The two fields specify the users and groups for the containers that you want to run. You can create a pod security policy to forcibly enable these fields. For more information, see Users and groups.
- Forbid users to run Docker-in-Docker containers or mount Docker.sock to containers
You can build or deploy container images inside a Docker container by using the Docker-in-Docker method or mounting Docker.sock. However, the process that runs inside the container gains access to the node. For more information about building container images on Kubernetes, see Use Container Registry Enterprise Edition to build images, Kaniko, and img.
- Limit the use of hostPath volumes, or allow only hostPath volumes that are set to
read-only and mount paths that start with specified prefixesA
hostPathvolume mounts a path from the host to a pod. In most cases, pods do not require hostPath volumes. Make sure that you understand the risks if you want to use hostPath volumes. By default, pods that run with the root privileges have read permissions on file systems that are exposed by using
hostPathvolumes. Attackers can modify the
kubeletsettings, and then create symbolic links to paths or files that are not directly exposed by
hostPathvolumes. For example, attackers can access
/etc/shadow, install SSH keys, read Secrets that are mounted to the host, and perform other malicious activities. To mitigate the risks that arise from
spec.containers.volumeMountsto read-only. Example:
volumeMounts: - name: hostPath-volume readOnly: true mountPath: /host-pathYou can also use a pod security policy to limit the paths that can be mounted by using
hostPathvolumes. For example, the following pod security policy specifies that only paths that start with
/fooon the host can be mounted.
allowedHostPaths: # This allows "/foo", "/foo/", "/foo/bar" etc., but # disallows "/fool", "/etc/foo" etc. # "/foo/../" is never valid. - pathPrefix: "/foo" readOnly: true # only allow read-only mounts
- Set resource requests and limits for each container to avoid resource competition
or prevent DoS attacks
A pod without resource requests or limits can consume all of the resources on a host. If additional pods are scheduled to a node, the CPU or memory resources of the node may be exhausted. As a result, the
kubeletmay crash or pods may be evicted from the node. This issue is inevitable. However, you can set resource requests and limits to minimize resource competition and reduce the risks from improperly programmed applications that consume excessive resources.
You can specify requests and limits for CPU and memory resources in the podSpec. You can set a resource quota or limit range on a namespace to force the use of requests and limits. A resource quota specifies the total amount of resources that are allocated to a namespace, such as CPU and memory resources. After you apply a resource quota to a namespace, the resource quota forces you to specify requests and limits for all containers deployed in the namespace. A limit range can be used to enforce fine-grained control on the resources that are allocated. You can set limit ranges to specify the maximum and minimum amounts of CPU and memory resources that each pod or container in a namespace can use. You can also use limit ranges to set the default request values or limit values if no default values are provided. For more information, see Managing Resources for Containers.
- Forbid privileged escalation
Privileged escalation allows a process to change the security context under which it runs. For example,
sudofiles are binary files with the
SGIDbit. Privileged escalation is a method that can be used by a user to execute a file with the permissions of another user or user group. To prevent privileged escalation, you can use a
pod security policythat has
- Disable automatic ServiceAccount token mountingFor pods that do not need to access the Kubernetes API, you can disable automatic
ServiceAccounttoken mounting in the
PodSpecof specific pods, or disable this feature for all pods that use a specific
apiVersion: v1 kind: Pod metadata: name: pod-no-automount spec: automountServiceAccountToken: falseAfter you disable automatic
ServiceAccounttoken mounting for a pod, the pod can still access the Kubernetes API. To prevent a pod from accessing the Kubernetes API, you must regulate access control on the
endpointof the ACK cluster and configure network policies to block the pod. For more information, see Use network policies.
apiVersion: v1 kind: ServiceAccount metadata: name: sa-no-automount automountServiceAccountToken: false
- Disable service discovery
You can reduce the amount of information provided to a pod if the pod does not need to look up and call cluster services. You can modify the CoreDNS policy of a pod to not use CoreDNS and to not expose Services as environment variables in the namespace of the pod. For more information, see Environment variables.
By default, the DNS policy of a pod is set to
ClusterFirst, which requires the pod to use the in-cluster DNS service. If the DNS policy is set to
Default, the pod is required to use the DNS resolution configurations from the underlying node. For more information, see Pod DNS policy.
After you disable service links and change the DNS policy of a pod, the pod can still access the in-cluster DNS service. Attackers can enumerate Services in an ACK cluster by accessing the in-cluster DNS service. Example:
dig SRV *.*.svc.cluster.local @$CLUSTER_DNS_IP. For more information about how to prevent service discovery within a cluster, see Use network policies.
apiVersion: v1 kind: Pod metadata: name: pod-no-service-info spec: dnsPolicy: Default # The value Default is not the default setting of a DNS policy. enableServiceLinks: false
- Configure container images to use a read-only file systemYou can configure container images to use a read-only file system to prevent attackers from overwriting files in the file system that is used by your application. If your application must write data to the file system, you can set the application to write to a temporary directory or mount a volume to the application. You can configure container images to use a read-only file system by setting the following pod
... securityContext: readOnlyRootFilesystem: true ...