This topic describes how to enforce pod security policies to protect your Container Service for Kubernetes (ACK) clusters.
Prevent container escapes that allow attackers to escalate privileges
Kubernetes developers or O&M administrators must focus on how to prevent container escapes that allow attackers to escalate privileges to control the host. Preventing container escapes is important due to the following reasons. By default, processes within a container run under the context of the [Linux] root
user. The operations that the root
user can perform are limited due to the Linux capabilities
that are assigned by Docker to the container. However, an attacker can exploit default capabilities to escalate privileges or access sensitive information on the host, such as Secrets
and ConfigMaps
. The following code block shows the default capabilities
that are assigned to a Docker container. For more information, see capabilities(7) - Linux manual page.
cap_chown, cap_dac_override, cap_fowner, cap_fsetid, cap_kill, cap_setgid, cap_setuid, cap_setpcap, cap_net_bind_service, cap_net_raw, cap_sys_chroot, cap_mknod, cap_audit_write, cap_setfcap
.
To prevent container escapes, you must avoid running Docker containers with the privileged
flag because a privileged container is assigned all Linux capabilities
of the root user.
All Kubernetes worker nodes use the node authorizer, which is a special-purpose authorization mode. The node authorizer is used to authorize all API requests that are sent by a kubelet. The node authorizer also allows a node to perform the following operations:
- Services
- Endpoints
- Nodes
- Pods
- Secrets, ConfigMaps, persistent volumes (PVs), and persistent volume claims (PVCs) of pods that are deployed on the node where the kubelet runs.
- Nodes and node status. Enable the
NodeRestriction
admission plug-in to allow the kubelet to modify only the node where the kubelet runs. - Pods and pod status. Enable the
NodeRestriction
admission plug-in to allow the kubelet to modify only the pod that is deployed on the node where the kubelet runs. - Events
- Read and write permissions on the
CertificateSigningRequest (CSR)
API for Transport Layer Security (TLS) bootstrapping. - Create
TokenReview
andSubjectAccessReview
for reviewing delegated identity authentication and authorization.
By default, ACK clusters use the NodeRestriction admission controller. The NodeRestriction admission controller allows a kubelet to modify only the Node API object and Pod API objects that are bound to the node. However, the admission controller cannot prevent attackers from collecting sensitive information from the Kubernetes API. For more information, see NodeRestriction.
PodSecurityPolicy (PSP) is deprecated in Kubernetes 1.21. We recommend the users that use the PSP feature to find an alternative feature before the PSP feature is removed in Kubernetes 1.25. The Kubernetes community is developing a built-in admission controller to replace PSP. ACK will provide policy governance solutions that use the Open Policy Agent (OPA) to replace the PSP feature in later versions.
Suggestions on pod security
- Limit the containers that can run in privileged mode
Privileged containers inherit all Linux capabilities of the root user on the same host. In most scenarios, containers do not need these capabilities to handle workloads. You can create a pod security policy to forbid pods to run in privileged mode. The pod security policy is a group of constraints that a pod must meet before the pod can be created. The PodSecurityPolicy admission controller of Kubernetes validates requests for creating and updating pods in your cluster based on the rules that you configured. If a request for creating or updating a pod does not meet the rules, the request is rejected and an error is returned.
By default, the PodSecurityPolicy admission controller is enabled for ACK clusters and a pod security policy named
ack.privileged
is created. The default pod security policy allows all types of pods. We recommend that you manage pod security policies by namespace based on the principle of least privilege. For example, privileged pods cannot be provisioned in a specified namespace, only read-only file systems can be used, or only host paths within a specified range can be mounted. The following code block shows an example of a pod security policy:apiVersion: policy/v1beta1 kind: PodSecurityPolicy metadata: name: restricted annotations: seccomp.security.alpha.kubernetes.io/allowedProfileNames: 'docker/default,runtime/default' apparmor.security.beta.kubernetes.io/allowedProfileNames: 'runtime/default' seccomp.security.alpha.kubernetes.io/defaultProfileName: 'runtime/default' apparmor.security.beta.kubernetes.io/defaultProfileName: 'runtime/default' spec: privileged: false # Prevent escalations to root. allowPrivilegeEscalation: false requiredDropCapabilities: - ALL # Allow core volume types. volumes: - 'configMap' - 'emptyDir' - 'projected' - 'secret' - 'downwardAPI' # Assume that the PVs created by the cluster administrator are safe to use. - 'persistentVolumeClaim' hostNetwork: false hostIPC: false hostPID: false runAsUser: # Require pods to run without root privileges. rule: 'MustRunAsNonRoot' seLinux: # Assume that all nodes use AppArmor instead of SELinux. rule: 'RunAsAny' supplementalGroups: rule: 'MustRunAs' ranges: # Forbid users to add the root group. - min: 1 max: 65535 fsGroup: rule: 'MustRunAs' ranges: # Forbid users to add the root group. - min: 1 max: 65535 readOnlyRootFilesystem: false
The preceding policy can be used to prevent privileged pods or escalations to root. The policy can also be used to limit the types of volumes that can be mounted and forbid users to add the root group. For more information about how to enhance pod security by using pod security policies, see Pod Security Policies.
- Run pods as a non-root user
By default, all pods run with the
root
privileges. Attackers can exploit vulnerabilities in applications and then gain access to theshell
of a pod. This poses risks to pod security. You can use multiple methods to mitigate the risks. You can delete theshell
from the container image. You can also add theUSER
instruction to theDockerfile
or run the containers as anon-root
user. Thespec.securityContext
attribute in thepodSpec
contains therunAsUser
andrunAsGroup
fields. The two fields specify the users and groups for the containers that you want to run. You can create a pod security policy to forcibly enable these fields. For more information, see Users and groups. - Forbid users to run Docker-in-Docker containers or mount Docker.sock to containers
You can build or deploy container images inside a Docker container by using the Docker-in-Docker method or mounting Docker.sock. However, the process that runs inside the container gains access to the node. For more information about building container images on Kubernetes, see Use Container Registry Enterprise Edition instances to build images, Kaniko, and img.
- Limit the use of hostPath volumes, or allow only hostPath volumes that are set to read-only and mount paths that start with specified prefixesA
hostPath
volume mounts a path from the host to a pod. In most cases, pods do not require hostPath volumes. Make sure that you understand the risks if you want to use hostPath volumes. By default, pods that run with the root privileges have read permissions on file systems that are exposed by usinghostPath
volumes. Attackers can modify thekubelet
settings, and then create symbolic links to paths or files that are not directly exposed byhostPath
volumes. For example, attackers can access/etc/shadow
, install SSH keys, read Secrets that are mounted to the host, and perform other malicious activities. To mitigate the risks that arise fromhostPath
volumes, setspec.containers.volumeMounts
to read-only. Example:volumeMounts: - name: hostPath-volume readOnly: true mountPath: /host-path
You can also use a pod security policy to limit the paths that can be mounted by usinghostPath
volumes. For example, the following pod security policy specifies that only paths that start with/foo
on the host can be mounted.allowedHostPaths: # This allows "/foo", "/foo/", "/foo/bar" etc., but # disallows "/fool", "/etc/foo" etc. # "/foo/../" is never valid. - pathPrefix: "/foo" readOnly: true # only allow read-only mounts
- Set resource requests and limits for each container to avoid resource competition or prevent DoS attacks
A pod without resource requests or limits can consume all of the resources on a host. If additional pods are scheduled to a node, the CPU or memory resources of the node may be exhausted. As a result, the
kubelet
may crash or pods may be evicted from the node. This issue is inevitable. However, you can set resource requests and limits to minimize resource competition and reduce the risks from improperly programmed applications that consume excessive resources.You can specify requests and limits for CPU and memory resources in the podSpec. You can set a resource quota or limit range on a namespace to force the use of requests and limits. A resource quota specifies the total amount of resources that are allocated to a namespace, such as CPU and memory resources. After you apply a resource quota to a namespace, the resource quota forces you to specify requests and limits for all containers deployed in the namespace. A limit range can be used to enforce fine-grained control on the resources that are allocated. You can set limit ranges to specify the maximum and minimum amounts of CPU and memory resources that each pod or container in a namespace can use. You can also use limit ranges to set the default request values or limit values if no default values are provided. For more information, see Managing Resources for Containers.
- Forbid privileged escalation
Privileged escalation allows a process to change the security context under which it runs. For example,
sudo
files are binary files with theSUID
orSGID
bit. Privileged escalation is a method that can be used by a user to execute a file with the permissions of another user or user group. To prevent privileged escalation, you can use apod security policy
that hasallowPriviledgedEscalation
set tofalse
or specifysecurityContext.allowPrivilegedEscalation
in thepodSpec
. - Disable automatic ServiceAccount token mountingFor pods that do not need to access the Kubernetes API, you can disable automatic
ServiceAccount
token mounting in thePodSpec
of specific pods, or disable this feature for all pods that use a specificServiceAccount
.apiVersion: v1 kind: Pod metadata: name: pod-no-automount spec: automountServiceAccountToken: false
After you disable automaticServiceAccount
token mounting for a pod, the pod can still access the Kubernetes API. To prevent a pod from accessing the Kubernetes API, you must regulate access control on theendpoint
of the ACK cluster and configure network policies to block the pod. For more information, see Use network policies.apiVersion: v1 kind: ServiceAccount metadata: name: sa-no-automount automountServiceAccountToken: false
- Disable service discovery
You can reduce the amount of information provided to a pod if the pod does not need to look up and call cluster services. You can modify the CoreDNS policy of a pod to not use CoreDNS and to not expose Services as environment variables in the namespace of the pod. For more information, see Environment variables.
By default, the DNS policy of a pod is set to
ClusterFirst
, which requires the pod to use the in-cluster DNS service. If the DNS policy is set toDefault
, the pod is required to use the DNS resolution configurations from the underlying node. For more information, see Pod DNS policy.After you disable service links and change the DNS policy of a pod, the pod can still access the in-cluster DNS service. Attackers can enumerate Services in an ACK cluster by accessing the in-cluster DNS service. Example:
dig SRV *.*.svc.cluster.local @$CLUSTER_DNS_IP
. For more information about how to prevent service discovery within a cluster, see Use network policies.apiVersion: v1 kind: Pod metadata: name: pod-no-service-info spec: dnsPolicy: Default # The value Default is not the default setting of a DNS policy. enableServiceLinks: false
- Configure container images to use a read-only file systemYou can configure container images to use a read-only file system to prevent attackers from overwriting files in the file system that is used by your application. If your application must write data to the file system, you can set the application to write to a temporary directory or mount a volume to the application. You can configure container images to use a read-only file system by setting the following pod
SecurityContext
:... securityContext: readOnlyRootFilesystem: true ...