By Alwyn Botha, Alibaba Cloud Community Blog author.
Taints and Tolerations are best understood using several exercises, which can be achieved through this tutorial.
Taints spoil a node electronically - marking it as undesirable for Pods. Pods specify tolerations - meaning they will tolerate a node with certain taints.
You can use taints and tolerations to deliberately prevent certain Pods from running on a node, or, to deliberately let certain Pods run on a node ( for example Pods that need ssd or GPUs, etc. ).
This tutorial contains several examples of taints and tolerations to help you get practical experience of this abstract concept.
This tutorial will work best if run on a Kubernetes cluster with only one node.
One node gets tainted and Pods are run to determine if they tolerate the taints on that one node.
If you have a vast cluster of nodes your Pod will automatically run on the any of the untainted nodes.
To learn taints and tolerations fastest it is best to have access to only one node.
If you must run this tutorial on a multi-node cluster you can simulate a single node: for all Pod specs below also add a nodename. Set nodename equal to the one node you have control over.
Nodename reference information : https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#nodename
Summary: simply add a nodeName: my-node-name in your YAML specs file.
This way the Kubernetes scheduler will only attempt to run your Pod on that ONE node. You can then learn in a tiny environment that you completely control.
Unfortunately this part of the tutorial will make most sense after you did all the exercises below.
So for the moment just believe me: run on a single node cluster or use nodeName.
After you followed this tutorial redo the Pods that failed in a cluster with more than one node. You will see those - Pods unschedulable on a one tainted node cluster - get scheduled on the other (untainted) nodes. ( These previous 2 sentences will also make sense after you did the complete tutorial. )
We add taints to a node using this syntax:
You have to supply your node-name , your key and your value.
Our first taint:
kubectl taint nodes minikube dedicated-app=my-dedi-app-a:NoSchedule
We taint our node called minikube so that
with
cannot be scheduled on this node ( NoSchedule )
Note the label below: app: my-dedi-app-a
nano mybusybox.yaml
apiVersion: v1
kind: Pod
metadata:
name: mybusypod
labels:
app: my-dedi-app-a
spec:
containers:
- name: my-dedi-container-a
image: busybox
imagePullPolicy: IfNotPresent
command: ['sh', '-c', 'sleep 3600']
restartPolicy: Never
terminationGracePeriodSeconds: 0
Our Pod uses busybox to sleep for 3600 seconds.
Note there are no tolerations in our Pod spec.
Our node is tainted, but our Pod does not have a toleration for that taint.
Theory suggests that this Pod will not be able to run on this node. Let's investigate :
Create Pod
kubectl create -f mybusybox.yaml
pod/mybusypod created
Get list of Pods:
kubectl get pods
NAME READY STATUS RESTARTS AGE
mybusypod 0/1 Pending 0 5s
Only relevant output from describe command:
kubectl describe pod/mybusypod
Name: mybusypod
Labels: app=my-dedi-app-a
Status: Pending
Containers:
my-dedi-container-a:
Image: busybox
Conditions:
Type Status
PodScheduled False
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 25s default-scheduler 0/1 nodes are available: 1 node(s) had taints that the pod didn't tolerate.
Last line explains what happened: 1 node(s) had taints that the pod didn't tolerate.
PodScheduled False ... Pod cannot be scheduled onto this node.
Tainting works. It prevents Pods that do not have a toleration for that taint from running on that node.
( I only have one Kubernetes node. In a setup where you have several nodes, Kubernetes will automatically seek out all the nodes until it finds a node where this Pod can run ... or it will state ... 0/397 nodes are available: 397 node(s) had taints that the pod didn't tolerate.)
kubectl delete -f mybusybox.yaml
pod "mybusypod" deleted
Let's now add a toleration for that taint to our Pod:
Note last 5 last lines add a toleration.
nano mybusybox.yaml
apiVersion: v1
kind: Pod
metadata:
name: mybusypod
labels:
app: my-dedi-app-a
spec:
containers:
- name: my-dedi-container-a
image: busybox
imagePullPolicy: IfNotPresent
command: ['sh', '-c', 'sleep 3600']
restartPolicy: Never
terminationGracePeriodSeconds: 0
tolerations:
- key: "dedicated-app"
operator: "Equal"
value: "my-dedi-app-a"
effect: "NoSchedule"
Create:
kubectl create -f mybusybox.yaml
pod/mybusypod created
List Pods:
kubectl get pods
NAME READY STATUS RESTARTS AGE
mybusypod 1/1 Running 0 4s
Success: now Pod is running. It tolerated the node with that taint.
Note the important Tolerations: dedicated-app=my-dedi-app-a:NoSchedule line below.
The other 2 tolerations are automatically added by Kubernetes to all Pods.
kubectl describe pod/mybusypod
Name: mybusypod
Node: minikube/10.0.2.15
Start Time: Mon, 11 Feb 2019 07:53:59 +0200
Labels: app=my-dedi-app-a
Status: Running
Containers:
my-dedi-container-a:
Command:
State: Running
Ready: True
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Tolerations: dedicated-app=my-dedi-app-a:NoSchedule
node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 11s default-scheduler Successfully assigned default/mybusypod to minikube
Normal Pulled 10s kubelet, minikube Container image "busybox" already present on machine
Normal Created 10s kubelet, minikube Created container
Normal Started 10s kubelet, minikube Started container
kubectl delete -f mybusybox.yaml
pod "mybusypod" deleted
We keep our existing taint and toleration in place. ( It works )
We add another taint: Pods with key:dedicated-app-exec and value:my-dedi-app-a must not be allowed to run on our node ( NoExecute ).
kubectl taint nodes minikube dedicated-app-exec=my-dedi-app-a:NoExecute
node/minikube tainted
Investigate first few lines of our node:
Note we now have 2 taints ( at bottom ).
kubectl describe node | head -n13
Name: minikube
Roles: master
Taints: dedicated-app-exec=my-dedi-app-a:NoExecute
dedicated-app=my-dedi-app-a:NoSchedule
Our Pod spec is as before: no toleration for this second taint.
nano mybusybox.yaml
apiVersion: v1
kind: Pod
metadata:
name: mybusypod
labels:
app: my-dedi-app-a
spec:
containers:
- name: my-dedi-container-a
image: busybox
imagePullPolicy: IfNotPresent
command: ['sh', '-c', 'sleep 3600']
restartPolicy: Never
terminationGracePeriodSeconds: 0
tolerations:
- key: "dedicated-app"
operator: "Equal"
value: "my-dedi-app-a"
effect: "NoSchedule"
We expect our Pod to be unable to run on this tainted node.
kubectl create -f mybusybox.yaml
pod/mybusypod created
List Pods:
kubectl get pods
NAME READY STATUS RESTARTS AGE
mybusypod 0/1 Pending 0 3s
As expected, Pod is pending.
Investigate why: look at the final line in the output.
kubectl describe pod/mybusypod
Name: mybusypod
Labels: app=my-dedi-app-a
Status: Pending
IP:
Containers:
my-dedi-container-a:
Image: busybox
Conditions:
Type Status
PodScheduled False
Tolerations: dedicated-app=my-dedi-app-a:NoSchedule
node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 17s (x2 over 17s) default-scheduler 0/1 nodes are available: 1 node(s) had taints that the pod didn't tolerate.
Behavior as expected.
It is disappointing to see the warning does not state WHICH taint the Pod did not tolerate. ( In production you will have many nodes each with many taints and lists of tolerations for your Pods. So you have to manually go through those lists to see which taint is not tolerated. )
Now we specify a toleration for the taint so that our Pod can run on the node. ( Note last 5 lines )
nano mybusybox.yaml
apiVersion: v1
kind: Pod
metadata:
name: mybusypod
labels:
app: my-dedi-app-a
spec:
containers:
- name: my-dedi-container-a
image: busybox
imagePullPolicy: IfNotPresent
command: ['sh', '-c', 'sleep 3600']
restartPolicy: Never
terminationGracePeriodSeconds: 0
tolerations:
- key: "dedicated-app"
operator: "Equal"
value: "my-dedi-app-a"
effect: "NoSchedule"
- key: "dedicated-app-exec"
operator: "Equal"
value: "my-dedi-app-a"
effect: "NoExecute"
tolerationSeconds: 60
The tolerationSeconds: 60 specify that our Pod can handle the dedicated-app-exec taint for only 60 seconds. After 60 seconds it will be swiftly and forcefully completely removed from the node.
Create Pod
kubectl create -f mybusybox.yaml
pod/mybusypod created
NAME READY STATUS RESTARTS AGE
mybusypod 1/1 Running 0 2s
A minute later ...
kubectl get pods
No resources found.
Pod automatically deleted from our node.
A node may have unlimited number of taints.
Pods may have unlimited number of tolerations.
We add more taints by first adding more labels to our Pod.
Below we add which-end: frontend label.
nano mybusybox.yaml
apiVersion: v1
kind: Pod
metadata:
name: mybusypod
labels:
app: my-dedi-app-a
which-end: frontend
spec:
containers:
- name: my-dedi-container-a
image: busybox
imagePullPolicy: IfNotPresent
command: ['sh', '-c', 'sleep 10']
restartPolicy: Never
terminationGracePeriodSeconds: 0
tolerations:
- key: "dedicated-app"
operator: "Equal"
value: "my-dedi-app-a"
effect: "NoSchedule"
- key: "dedicated-app-exec"
operator: "Equal"
value: "my-dedi-app-a"
effect: "NoExecute"
tolerationSeconds: 60
We taint our node with a new taint using the which-end key.
kubectl taint nodes minikube which-end=frontend:NoSchedule
node/minikube tainted
kubectl describe node | head -n13
Name: minikube
Taints: dedicated-app-exec=my-dedi-app-a:NoExecute
dedicated-app=my-dedi-app-a:NoSchedule
which-end=frontend:NoSchedule
Our Pod has no toleration for this taint. Running will fail.
Tolerations: dedicated-app=my-dedi-app-a:NoSchedule
dedicated-app-exec=my-dedi-app-a:NoExecute for 60s
node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
kubectl create -f mybusybox.yaml
pod/mybusypod created
kubectl get pods
NAME READY STATUS RESTARTS AGE
mybusypod 0/1 Pending 0 2s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 12s (x2 over 12s) default-scheduler 0/1 nodes are available: 1 node(s) had taints that the pod didn't tolerate.
All this should be old news to you at this point.
In the next section you will learn an alternative way to tolerate taints.
kubectl delete -f mybusybox.yaml
pod "mybusypod" deleted
Note the last 2 lines of our Pod spec:
- key: "which-end"
operator: "Exists"
This special syntax specifies that our Pod tolerates ALL which-end key values.
Note our Pod spec labels below : which-end: frontend ... our Pod provides front-end functionality.
nano mybusybox.yaml
apiVersion: v1
kind: Pod
metadata:
name: mybusypod
labels:
app: my-dedi-app-a
which-end: frontend
spec:
containers:
- name: my-dedi-container-a
image: busybox
imagePullPolicy: IfNotPresent
command: ['sh', '-c', 'sleep 10']
restartPolicy: Never
terminationGracePeriodSeconds: 0
tolerations:
- key: "dedicated-app"
operator: "Equal"
value: "my-dedi-app-a"
effect: "NoSchedule"
- key: "dedicated-app-exec"
operator: "Equal"
value: "my-dedi-app-a"
effect: "NoExecute"
tolerationSeconds: 60
- key: "which-end"
operator: "Exists"
Note on the command 'sleep 10' . We only let Pod run 10 seconds. It has tolerationSeconds: 60 . So it will run to completion within 10 seconds.
kubectl get pods
NAME READY STATUS RESTARTS AGE
mybusypod 1/1 Running 0 3s
NAME READY STATUS RESTARTS AGE
mybusypod 0/1 Completed 0 18s
List of tolerations for our Pod - note which-end at bottom. Pod tolerates ALL which-end taints.
Tolerations: dedicated-app=my-dedi-app-a:NoSchedule
dedicated-app-exec=my-dedi-app-a:NoExecute for 60s
node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
which-end
kubectl delete -f mybusybox.yaml
pod "mybusypod" deleted
Note the special syntax below: tolerate ALL taints.
tolerations:
- operator: "Exists"
nano mybusybox.yaml
apiVersion: v1
kind: Pod
metadata:
name: mybusypod
labels:
app: my-dedi-app-a
which-end: frontend
spec:
containers:
- name: my-dedi-container-a
image: busybox
imagePullPolicy: IfNotPresent
command: ['sh', '-c', 'sleep 10']
restartPolicy: Never
terminationGracePeriodSeconds: 0
tolerations:
- operator: "Exists"
Create Pod
kubectl create -f mybusybox.yaml
pod/mybusypod created
We see below that our Pod runs to completion with no problems.
kubectl get pods
NAME READY STATUS RESTARTS AGE
mybusypod 1/1 Running 0 2s
kubectl get pods
NAME READY STATUS RESTARTS AGE
mybusypod 1/1 Running 0 7s
kubectl get pods
NAME READY STATUS RESTARTS AGE
mybusypod 1/1 Running 0 12s
kubectl get pods
NAME READY STATUS RESTARTS AGE
mybusypod 0/1 Completed 0 16s
Investigate kubectl describe pod/mybusypod output
Tolerations:
Amazing: nothing there specifies our Pod tolerates all taints.
kubectl delete -f mybusybox.yaml
pod "mybusypod" deleted
The NoSchedule taints prevents scheduling Pods on a node.
The PreferNoSchedule taints prevents scheduling Pods on a node, BUT, if no suitable untainted node can be found then it WILL schedule the Pod on that node.
From https://kubernetes.io/docs/concepts/configuration/taint-and-toleration/
This is a "preference" or "soft" version of NoSchedule – the system will try to avoid placing a pod that does not tolerate the taint on the node, but it is not required.
Current taints on our node:
Taints: dedicated-app-exec=my-dedi-app-a:NoExecute
dedicated-app=my-dedi-app-a:NoSchedule
which-end=frontend:NoSchedule
Let's remove all these taints. ( Note syntax hyphen at the end ... I understand it as : subtract taint from this node )
kubectl taint nodes minikube dedicated-app-exec:NoExecute-
kubectl taint nodes minikube dedicated-app:NoSchedule-
kubectl taint nodes minikube which-end:NoSchedule-
node/minikube untainted
node/minikube untainted
node/minikube untainted
Our node now totally untainted:
Taints: <none>
Let's only add PreferNoSchedule taint - so that we can see how it works.
kubectl taint nodes minikube dedicated-app=my-dedi-app-a:PreferNoSchedule
node/minikube tainted
Our Pod spec specifies NO tolerations.
nano mybusybox.yaml
apiVersion: v1
kind: Pod
metadata:
name: mybusypod
labels:
app: my-dedi-app-a
which-end: frontend
spec:
containers:
- name: my-dedi-container-a
image: busybox
imagePullPolicy: IfNotPresent
command: ['sh', '-c', 'sleep 10']
restartPolicy: Never
terminationGracePeriodSeconds: 0
Create Pod
kubectl create -f mybusybox.yaml
pod/mybusypod created
PreferNoSchedule could not find any untainted nodes, so it will allow this Pod on this node.
kubectl get pods
NAME READY STATUS RESTARTS AGE
mybusypod 0/1 ContainerCreating 0 2s
NAME READY STATUS RESTARTS AGE
mybusypod 1/1 Running 0 5s
NAME READY STATUS RESTARTS AGE
mybusypod 1/1 Running 0 8s
NAME READY STATUS RESTARTS AGE
mybusypod 0/1 Completed 0 13s
kubectl delete -f mybusybox.yaml
pod "mybusypod" deleted
Extract from details about our node.
kubectl describe node | head -n13
Name: minikube
Taints: dedicated-app=my-dedi-app-a:PreferNoSchedule
Unschedulable: false
Unschedulable: false means Schedulable: true
( Why the double negative: Unschedulable ... I do not know )
We can set the node to Unschedulable = true by cordoning the node.
kubectl cordon minikube
node/minikube cordoned
Node will not allow any new Pods to start running :
Create Pod
kubectl create -f mybusybox.yaml
pod/mybusypod created
kubectl get pods
NAME READY STATUS RESTARTS AGE
mybusypod 0/1 Pending 0 3s
Pending as expected
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 7s default-scheduler 0/1 nodes are available: 1 node(s) were unschedulable.
We use uncordon to allow Pods to run on node again.
kubectl uncordon minikube
node/minikube uncordoned
Our Pod starts running automatically.
kubectl get pods
NAME READY STATUS RESTARTS AGE
mybusypod 1/1 Running 0 45s
kubectl get pods
NAME READY STATUS RESTARTS AGE
mybusypod 1/1 Running 0 52s
kubectl get pods
NAME READY STATUS RESTARTS AGE
mybusypod 0/1 Completed 0 60s
kubectl delete -f mybusybox.yaml
pod "mybusypod" deleted
Note the last 3 lines of our Pod spec: this is how we tolerate the unschedulable condition.
nano mybusybox.yaml
apiVersion: v1
kind: Pod
metadata:
name: mybusypod
labels:
app: my-dedi-app-a
which-end: frontend
spec:
containers:
- name: my-dedi-container-a
image: busybox
imagePullPolicy: IfNotPresent
command: ['sh', '-c', 'sleep 10']
restartPolicy: Never
terminationGracePeriodSeconds: 0
tolerations:
- key: "node.kubernetes.io/unschedulable"
operator: "Exists"
Taint node as unschedulable:
kubectl cordon minikube
node/minikube cordoned
Attempt to run our Pod :
kubectl create -f mybusybox.yaml
pod/mybusypod created
Success below: our Pod tolerates the unschedulable status.
kubectl get pods
NAME READY STATUS RESTARTS AGE
mybusypod 1/1 Running 0 2s
NAME READY STATUS RESTARTS AGE
mybusypod 1/1 Running 0 7s
NAME READY STATUS RESTARTS AGE
mybusypod 1/1 Running 0 11s
NAME READY STATUS RESTARTS AGE
mybusypod 0/1 Completed 0 15s
Using the same syntax as above you can ( during emergencies ) run Pods on nodes with these conditions
These are not tolerations you should add to your Pods in general. This trick is for emergency use only.
Most critical Kubernetes system daemons tolerate those taints. ( These daemons must continue to run during these out of resources conditions to keep the Kubernetes system running. )
Make node available again.
kubectl uncordon minikube
node/minikube uncordoned
kubectl delete -f mybusybox.yaml
pod "mybusypod" deleted
Determine list of taints on this node:
kubectl describe node | head -n13
Name: minikube
Taints: dedicated-app=my-dedi-app-a:PreferNoSchedule
Unschedulable: false
Conditions:
Let's remove taints.
kubectl taint nodes minikube dedicated-app:PreferNoSchedule-
node/minikube untainted
2,599 posts | 763 followers
FollowAlibaba Clouder - January 12, 2021
Xi Ning Wang(王夕宁) - September 20, 2023
Alibaba Developer - June 16, 2020
Alibaba Container Service - May 19, 2021
Xi Ning Wang(王夕宁) - September 20, 2023
Alibaba Cloud Native Community - August 25, 2022
2,599 posts | 763 followers
FollowElastic and secure virtual cloud servers to cater all your cloud hosting needs.
Learn MoreLearn More
Alibaba Cloud Container Service for Kubernetes is a fully managed cloud container management service that supports native Kubernetes and integrates with other Alibaba Cloud products.
Learn MoreMore Posts by Alibaba Clouder