Taints and Affinity
So far, when we deployed any Pod in the Kubernetes cluster, it was run on any node that met the requirements (ie memory requirements, CPU requirements, …)
However, in Kubernetes there are two concepts that allow you to further configure the scheduler, so that Pods are assigned to Nodes following some business criteria.
Preparation
Minikube Multinode
If you are running this tutorial in Minikube, you need to add more nodes to run this part of the tutorial. Check the number of nodes you have delpoyed by running:
kubectl get nodes
If only one node is present then you need to create a new node by following the next steps:
NAME STATUS ROLES AGE VERSION
kube Ready master 54m v1.17.3
Having minikube
installed and in your PATH
, then run:
minikube node add -p devnation
Or if you do not have enough resources to add a new node with same resources as master node, you can create a new cluster with minial requirements.
minikube start --nodes 2 -p multinode --kubernetes-version='v1.26.1' --vm-driver='virtualbox' --memory=2048
kubectl get nodes
NAME STATUS ROLES AGE VERSION
kube Ready master 54m v1.17.3
kube-m02 Ready <none> 2m50s v1.17.3
Multi-node clusters are currently experimental and might exhibit unintended behavior. |
Watch Nodes
To be able to observe what’s going on, let’s open another terminal (Terminal 2) and watch
what happens to the pods as we change taints on the nodes.
watch -n 1 "kubectl get pods -o wide \(1)
| awk '{print \$1 \" \" \$2 \" \" \$3 \" \" \$5 \" \" \$7}' | column -t" (2)
1 | the -o wide option allows us to see the node that the pod is schedule to |
2 | to keep the line from getting too long we’ll use awk and column to get and format only the columns we want |
Taints
A Taint is applied to a Kubernetes Node that signals the scheduler to avoid or not schedule certain Pods.
A Toleration is applied to a Pod definition and provides an exception to the taint.
Let’s describe the current nodes, in this case as an OpenShift cluster is used, you can see several nodes:
kubectl describe nodes | egrep "Name:|Taints:"
Name: ip-10-0-136-107.eu-central-1.compute.internal
Taints: node-role.kubernetes.io/master:NoSchedule
Name: ip-10-0-140-186.eu-central-1.compute.internal
Taints: <none>
Name: ip-10-0-141-128.eu-central-1.compute.internal
Taints: <none>
Name: ip-10-0-146-109.eu-central-1.compute.internal
Taints: <none>
Name: ip-10-0-150-226.eu-central-1.compute.internal
Taints: <none>
Notice that in this case, the |
Let’s add a taint to all nodes:
kubectl taint nodes --all=true color=blue:NoSchedule
node/ip-10-0-136-107.eu-central-1.compute.internal tainted
node/ip-10-0-140-186.eu-central-1.compute.internal tainted
node/ip-10-0-141-128.eu-central-1.compute.internal tainted
node/ip-10-0-146-109.eu-central-1.compute.internal tainted
node/ip-10-0-150-226.eu-central-1.compute.internal tainted
node/ip-10-0-155-122.eu-central-1.compute.internal tainted
node/ip-10-0-162-206.eu-central-1.compute.internal tainted
node/ip-10-0-168-102.eu-central-1.compute.internal tainted
node/ip-10-0-175-64.eu-central-1.compute.internal tainted
The color=blue is simply a key=value pair to identify the taint and NoSchedule is the specific effect for pods that can’t "tolerate" the taint. In other words, if a pod does not tolerate "color=blue" then the effect will be "NoSchedule"
So let’s try this out. From the main terminal, we’ll deploy a new pod that doesn’t have any particular tolerations:
You’ll see the output in the other terminal change
The pod will remain in Pending
status as it has no schedulable Node available.
We can get more insight into this by entering the following
kubectl describe pod (1)
1 | There is only one pod in this case. If we wanted to be specific, we could add the name of the pod (e.g. myboot-7f889dd6d-n5z55 ) |
Name: myboot-7cbfbd9b89-bzhxw
Namespace: myspace
Priority: 0
Node: <none>
Labels: app=myboot
pod-template-hash=7cbfbd9b89
Annotations: <none>
Status: Pending
...
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 13s (x2 over 14s) default-scheduler 0/2 nodes are available: 2 node(s) had taint {color: blue}, that the pod didn't tolerate.
Let’s get the list of nodes in our cluster
kubectl get nodes
NAME STATUS ROLES AGE VERSION
devnation Ready control-plane,master 2d22h v1.21.2
devnation-m02 Ready <none> 40h v1.21.2
And pick one node that we will remove the taint from:
kubectl taint node devnation-m02 color:NoSchedule- (1)
1 | adding the - here means to remove the taint in question (the color with the action NoSchedule ) |
node/devnation-m02 untainted
kubectl describe pod (1)
1 | There is only one pod in this case. If we wanted to be specific, we could add the name of the pod (e.g. myboot-7f889dd6d-n5z55 ) |
Name: myboot-7f889dd6d-n5z55
Namespace: kubetut
Priority: 0
Node: <none>
Labels: app=myboot
pod-template-hash=7f889dd6d
Annotations: openshift.io/scc: restricted
Status: Pending
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling <unknown> default-scheduler 0/9 nodes are available: 9 node(s) had taints that the pod didn't tolerate.
Warning FailedScheduling <unknown> default-scheduler 0/9 nodes are available: 9 node(s) had taints that the pod didn't tolerate.
Let’s get the list of nodes in our cluster
kubectl get nodes
NAME STATUS ROLES AGE VERSION
ip-10-0-136-107.eu-central-1.compute.internal Ready master 20h v1.16.2
ip-10-0-140-186.eu-central-1.compute.internal Ready worker 20h v1.16.2
ip-10-0-141-128.eu-central-1.compute.internal Ready worker 18h v1.16.2
ip-10-0-146-109.eu-central-1.compute.internal Ready worker 18h v1.16.2
ip-10-0-150-226.eu-central-1.compute.internal Ready worker 20h v1.16.2
ip-10-0-155-122.eu-central-1.compute.internal Ready master 20h v1.16.2
ip-10-0-162-206.eu-central-1.compute.internal Ready worker 20h v1.16.2
ip-10-0-168-102.eu-central-1.compute.internal Ready master 20h v1.16.2
ip-10-0-175-64.eu-central-1.compute.internal Ready worker 18h v1.16.2
And pick one node that we will remove the taint from:
kubectl taint node ip-10-0-140-186.eu-central-1.compute.internal color:NoSchedule- (1)
1 | adding the - here means to remove the taint in question (the color with the action NoSchedule ) |
node/ip-10-0-140-186.eu-central-1.compute.internal untainted
Now in Terminal 2 you should see the Pending pod scheduled to the newly untained node.
NAME READY STATUS AGE NODE
myboot-7cbfbd9b89-hqx6h 0/1 ContainerCreating 20m devnation-m02
Finally, let’s take a quick look at the taint status on all the nodes.
kubectl describe nodes | egrep "Name:|Taints:"
Name: ip-10-0-136-107.eu-central-1.compute.internal
Taints: node-role.kubernetes.io/master:NoSchedule
Name: ip-10-0-140-186.eu-central-1.compute.internal
Taints: <none>
Name: ip-10-0-141-128.eu-central-1.compute.internal
Taints: color=blue:NoSchedule
Name: ip-10-0-146-109.eu-central-1.compute.internal
Taints: color=blue:NoSchedule
Restore Taint
Add the taint back to the node (or in this case all nodes):
kubectl taint nodes --all=true color=blue:NoSchedule --overwrite
Setting the taint on all nodes is a bit sloppy. If you’d like you can get the same effect a bit more elegantly by setting the taint only on the node from which it was removed. For example: kubectl taint node ip-10-0-140-186.eu-central-1.compute.internal color=blue:NoSchedule |
Take a look and notice that the pod is still running despite the change in taint (this is due to scheduling being a one time activity in the lifecycle of a pod)
Tolerations
Let’s create a Pod but containing a toleration, so it can be scheduled to a tainted node.
spec:
tolerations:
- key: "color"
operator: "Equal"
value: "blue"
effect: "NoSchedule"
containers:
- name: myboot
image: quay.io/rhdevelopers/myboot:v1
And then we should see before too long in our watch window our pod get scheduled and advance to the run state
Now, although all nodes contain a taint, the Pod is scheduled and run as we defined a tolerance against color=blue taint.
NoExecution
Taint
So far, you’ve seen the NoSchedule
taint effect which means that newly created Pods will not be scheduled there unless they have an overriding toleration.
But notice that if we add this taint to a node that already has running/scheduled Pods, this taint will not terminate them.
Let’s change that by using NoExecution
effect.
First of all, let’s remove all previous taints.
Then deploy another instance of myboot (with no Tolerations):
We should see the following in the watch
Now let’s taint find the node the pod is running on
NODE=$(kubectl get pod -o jsonpath='{.items[0].spec.nodeName}') (1)
echo ${NODE}
1 | the .items[0] is because we’re asking for all pods, but we know our list will contain only one element |
"ip-10-0-146-109.eu-central-1.compute.internal"
As soon as we do this, we should be able to watch this "rescheduling" occur in the Terminal 2 watch
NAME READY STATUS AGE NODE
myboot-7cbfbd9b89-5t24z 0/1 ContainerCreating 16s devnation
myboot-7cbfbd9b89-wpddg 1/1 Terminating 65m devnation-m02
If you have more nodes available then the Pod is terminated and deployed onto another node, if it is not the case, then the Pod will remain in |
Affinity & Anti-Affinity
There is another way of changing where Pods are scheduled using Node/Pod Affinity and Anti-affinity. You can create rules that not only ban where Pods can run but also to favor where they should be run.
In addition to creating affinities between Pods and Nodes, you can also create affinities between Pods. You can decide that a group of Pods should be always be deployed together on the same node(s). Reasons such as significant network communication between Pods and you want to avoid external network calls or perhaps shared storage devices.
Node Affinity
Let’s deploy a new pod with a node affinity. Take a look at myboot-node-affinity.yml
(relevant section shown below)
If you’re running this from within VSCode you can use CTRL+p (or CMD+p on Mac OSX) to quickly open |
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution: (1)
nodeSelectorTerms:
- matchExpressions:
- key: color
operator: In
values:
- blue (2)
containers:
- name: myboot
image: quay.io/rhdevelopers/myboot:v1
1 | This key highlights that what’s follows must be used during scheduling but not a factor once a pod is executing |
2 | The matchExpressions is saying this pod has affinity for any node with a color in the value set blue |
Now let’s deploy this
And we’ll see in our watch window the pod in a pending state
Let’s create a label on a node matching the affinity expression:
Get a list of nodes:
kubectl get nodes
NAME STATUS ROLES AGE VERSION
devnation Ready control-plane,master 3d v1.21.2
devnation-m02 Ready <none> 42h v1.21.2
Then pick a node in the list to label (such as the one highlighted)
kubectl label nodes devnation-m02 color=blue (1)
1 | Notice that this matches the affinity in the pod |
node/devnation-m02 labeled
Get a list of nodes:
kubectl get nodes
NAME STATUS ROLES AGE VERSION
ip-10-0-136-107.eu-central-1.compute.internal Ready master 26h v1.16.2
ip-10-0-140-186.eu-central-1.compute.internal Ready worker 26h v1.16.2
ip-10-0-141-128.eu-central-1.compute.internal Ready worker 25h v1.16.2
ip-10-0-146-109.eu-central-1.compute.internal Ready worker 25h v1.16.2
ip-10-0-150-226.eu-central-1.compute.internal Ready worker 26h v1.16.2
ip-10-0-155-122.eu-central-1.compute.internal Ready master 26h v1.16.2
ip-10-0-162-206.eu-central-1.compute.internal Ready worker 26h v1.16.2
ip-10-0-168-102.eu-central-1.compute.internal Ready master 26h v1.16.2
ip-10-0-175-64.eu-central-1.compute.internal Ready worker 25h v1.16.2
Then pick a node in the list to label (such as the one highlighted)
kubectl label nodes ip-10-0-175-64.eu-central-1.compute.internal color=blue (1)
1 | Notice that this matches the affinity in the pod |
node/ip-10-0-175-64.eu-central-1.compute.internal labeled
And then in the watch window the output should change to:
NAME READY STATUS AGE NODE
myboot-546d4d9b45-7vgfc 0/1 ContainerCreating 15m devnation-m02
Let’s delete the label from the node that the pod is running on
First find the node the pod is running on
NODE=$(kubectl get pod -o jsonpath='{.items[0].spec.nodeName}') (1)
echo ${NODE}
1 | the .items[0] is because we’re asking for all pods, but we know our list will contain only one element |
and then remove the color label from it
kubectl label nodes ${NODE} color-
And notice the that watch output is unchanged and if running, the pod will continue to run
Since we used the requiredDuringSchedulingIgnoredDuringExecution
in the deployment spec for our pod, we got our affinity to work like taints (in the previous section) worked, namely, that the rule is set during the scheduling phase but ignore after that (i.e. once executing). Therefore the Pod is not removed in our case.
This is an example of a hard rule:
There is also a way to create a soft rule:
Consider the example below:
spec:
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution: (1)
- weight: 1
preference:
matchExpressions:
- key: color
operator: In
values:
- blue
1 | You can see the use of the word preferred vs required. |
Pod Affinity/Anti-Affinity
Let’s deploy a new pod with a Pod Affinity. See this relevant part of myboot-pod-affinity.yml
.
If you’re running this from within VSCode you can use CTRL+p (or CMD+p on Mac OSX) to quickly open |
spec:
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- topologyKey: kubernetes.io/hostname (1)
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- myboot (2)
containers:
1 | The node label key. If two nodes are labeled with this key and have identical values, the scheduler treats both nodes as being in the same topology. In this case, hostname is a label that is different for each node. |
2 | The affinity is with Pods labeled with app=myboot . |
kubectl apply -f apps/kubefiles/myboot-pod-affinity.yml
NAME READY STATUS AGE NODE
myboot2-7c5f46cbc9-hwm2v 0/1 Pending 5h38m <none>
The myboot2
Pod is pending as couldn’t find any Pod matching the affinity rule.
To address this, let’s deploy a myboot
application labeled with app=myboot
.
And we’ll see that both start up, and run on the same node
NAME READY STATUS AGE NODE
myboot-7cbfbd9b89-267k6 0/1 ContainerCreating 5s devnation-m02
myboot2-7c5f46cbc9-hwm2v 0/1 ContainerCreating 5h45m devnation-m02
What you’ve just seen is a hard rule, you can use a "soft" rules as well in Pod Affinity.
|
Anti-affinity is used to insure that two Pods do NOT run together on the same node.
Let’s add another pod. Open myboot-pod-antiaffinity.yaml
and focus on the following part
If you’re running this from within VSCode you can use CTRL+p (or CMD+p on Mac OSX) to quickly open |
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- topologyKey: kubernetes.io/hostname
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- myboot
This basically says that this pod should not be scheduled on any individual node (topologyKey: kubernetes.io/hostname
) that has a pod with the app=myboot
label.
Deploy a myboot3 with the above anti-affinity rule
And then notice what happens in the watch window
NAME READY STATUS AGE NODE
myboot-7cbfbd9b89-267k6 1/1 Running 10m devnation-m02
myboot2-7c5f46cbc9-hwm2v 1/1 Running 5h56m devnation-m02
myboot3-6f95c866f6-7kvdw 0/1 ContainerCreating 6s devnation
As you can see from the highlight, the myboot3
Pod is deployed in a different node than the myboot
Pod