Taints and Affinity

So far, when we deployed any Pod in the Kubernetes cluster, it was run on any node that met the requirements (ie memory requirements, CPU requirements, …​)

However, in Kubernetes there are two concepts that allow you to further configure the scheduler, so that Pods are assigned to Nodes following some business criteria.

Preparation

Minikube Multinode

If you are running this tutorial in Minikube, you need to add more nodes to run this part of the tutorial. Check the number of nodes you have delpoyed by running:

kubectl get nodes

If only one node is present then you need to create a new node by following the next steps:

NAME       STATUS   ROLES    AGE     VERSION
kube       Ready    master   54m     v1.17.3

Having minikube installed and in your PATH, then run:

minikube node add -p devnation

Or if you do not have enough resources to add a new node with same resources as master node, you can create a new cluster with minial requirements.

minikube start --nodes 2 -p multinode --kubernetes-version='v1.26.1' --vm-driver='virtualbox' --memory=2048
kubectl get nodes
NAME       STATUS   ROLES    AGE     VERSION
kube       Ready    master   54m     v1.17.3
kube-m02   Ready    <none>   2m50s   v1.17.3
Multi-node clusters are currently experimental and might exhibit unintended behavior.

Watch Nodes

To be able to observe what’s going on, let’s open another terminal (Terminal 2) and watch what happens to the pods as we change taints on the nodes.

  • Terminal 2

watch -n 1 "kubectl get pods -o wide \(1)
  | awk '{print \$1 \" \" \$2 \" \" \$3 \" \" \$5 \" \" \$7}' | column -t" (2)
1 the -o wide option allows us to see the node that the pod is schedule to
2 to keep the line from getting too long we’ll use awk and column to get and format only the columns we want

Taints

A Taint is applied to a Kubernetes Node that signals the scheduler to avoid or not schedule certain Pods.

A Toleration is applied to a Pod definition and provides an exception to the taint.

Let’s describe the current nodes, in this case as an OpenShift cluster is used, you can see several nodes:

kubectl describe nodes | egrep "Name:|Taints:"
Name:               ip-10-0-136-107.eu-central-1.compute.internal
Taints:             node-role.kubernetes.io/master:NoSchedule
Name:               ip-10-0-140-186.eu-central-1.compute.internal
Taints:             <none>
Name:               ip-10-0-141-128.eu-central-1.compute.internal
Taints:             <none>
Name:               ip-10-0-146-109.eu-central-1.compute.internal
Taints:             <none>
Name:               ip-10-0-150-226.eu-central-1.compute.internal
Taints:             <none>

Notice that in this case, the master node contains a taint which blocks your application Pods from being scheduled there.

Let’s add a taint to all nodes:

kubectl taint nodes --all=true color=blue:NoSchedule
node/ip-10-0-136-107.eu-central-1.compute.internal tainted
node/ip-10-0-140-186.eu-central-1.compute.internal tainted
node/ip-10-0-141-128.eu-central-1.compute.internal tainted
node/ip-10-0-146-109.eu-central-1.compute.internal tainted
node/ip-10-0-150-226.eu-central-1.compute.internal tainted
node/ip-10-0-155-122.eu-central-1.compute.internal tainted
node/ip-10-0-162-206.eu-central-1.compute.internal tainted
node/ip-10-0-168-102.eu-central-1.compute.internal tainted
node/ip-10-0-175-64.eu-central-1.compute.internal tainted

The color=blue is simply a key=value pair to identify the taint and NoSchedule is the specific effect for pods that can’t "tolerate" the taint. In other words, if a pod does not tolerate "color=blue" then the effect will be "NoSchedule"

So let’s try this out. From the main terminal, we’ll deploy a new pod that doesn’t have any particular tolerations:

  • Terminal 1

kubectl apply -f apps/kubefiles/myboot-deployment.yml

You’ll see the output in the other terminal change

  • Terminal 2

NAME                      READY   STATUS    AGE     NODE
myboot-7cbfbd9b89-hqx6h   0/1     Pending   4m12s   devnation

The pod will remain in Pending status as it has no schedulable Node available.

We can get more insight into this by entering the following

  • Terminal 1 - Minikube

  • Terminal 1 - OpenShift

kubectl describe pod (1)
1 There is only one pod in this case. If we wanted to be specific, we could add the name of the pod (e.g. myboot-7f889dd6d-n5z55)
Name:           myboot-7cbfbd9b89-bzhxw
Namespace:      myspace
Priority:       0
Node:           <none>
Labels:         app=myboot
                pod-template-hash=7cbfbd9b89
Annotations:    <none>
Status:         Pending
...
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason            Age                From               Message
  ----     ------            ----               ----               -------
  Warning  FailedScheduling  13s (x2 over 14s)  default-scheduler  0/2 nodes are available: 2 node(s) had taint {color: blue}, that the pod didn't tolerate.

Let’s get the list of nodes in our cluster

kubectl get nodes
NAME            STATUS   ROLES                  AGE     VERSION
devnation       Ready    control-plane,master   2d22h   v1.21.2
devnation-m02   Ready    <none>                 40h     v1.21.2

And pick one node that we will remove the taint from:

kubectl taint node devnation-m02 color:NoSchedule- (1)
1 adding the - here means to remove the taint in question (the color with the action NoSchedule)
node/devnation-m02  untainted
kubectl describe pod (1)
1 There is only one pod in this case. If we wanted to be specific, we could add the name of the pod (e.g. myboot-7f889dd6d-n5z55)
Name:           myboot-7f889dd6d-n5z55
Namespace:      kubetut
Priority:       0
Node:           <none>
Labels:         app=myboot
                pod-template-hash=7f889dd6d
Annotations:    openshift.io/scc: restricted
Status:         Pending

Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason            Age        From               Message
  ----     ------            ----       ----               -------
  Warning  FailedScheduling  <unknown>  default-scheduler  0/9 nodes are available: 9 node(s) had taints that the pod didn't tolerate.
  Warning  FailedScheduling  <unknown>  default-scheduler  0/9 nodes are available: 9 node(s) had taints that the pod didn't tolerate.

Let’s get the list of nodes in our cluster

kubectl get nodes
NAME                                            STATUS   ROLES    AGE   VERSION
ip-10-0-136-107.eu-central-1.compute.internal   Ready    master   20h   v1.16.2
ip-10-0-140-186.eu-central-1.compute.internal   Ready    worker   20h   v1.16.2
ip-10-0-141-128.eu-central-1.compute.internal   Ready    worker   18h   v1.16.2
ip-10-0-146-109.eu-central-1.compute.internal   Ready    worker   18h   v1.16.2
ip-10-0-150-226.eu-central-1.compute.internal   Ready    worker   20h   v1.16.2
ip-10-0-155-122.eu-central-1.compute.internal   Ready    master   20h   v1.16.2
ip-10-0-162-206.eu-central-1.compute.internal   Ready    worker   20h   v1.16.2
ip-10-0-168-102.eu-central-1.compute.internal   Ready    master   20h   v1.16.2
ip-10-0-175-64.eu-central-1.compute.internal    Ready    worker   18h   v1.16.2

And pick one node that we will remove the taint from:

kubectl taint node ip-10-0-140-186.eu-central-1.compute.internal color:NoSchedule- (1)
1 adding the - here means to remove the taint in question (the color with the action NoSchedule)
node/ip-10-0-140-186.eu-central-1.compute.internal  untainted

Now in Terminal 2 you should see the Pending pod scheduled to the newly untained node.

  • Terminal 2

NAME                      READY   STATUS              AGE       NODE
myboot-7cbfbd9b89-hqx6h   0/1     ContainerCreating   20m   devnation-m02

Finally, let’s take a quick look at the taint status on all the nodes.

  • Terminal 1

kubectl describe nodes | egrep "Name:|Taints:"
Name:               ip-10-0-136-107.eu-central-1.compute.internal
Taints:             node-role.kubernetes.io/master:NoSchedule
Name:               ip-10-0-140-186.eu-central-1.compute.internal
Taints:             <none>
Name:               ip-10-0-141-128.eu-central-1.compute.internal
Taints:             color=blue:NoSchedule
Name:               ip-10-0-146-109.eu-central-1.compute.internal
Taints:             color=blue:NoSchedule

Restore Taint

Add the taint back to the node (or in this case all nodes):

kubectl taint nodes --all=true color=blue:NoSchedule --overwrite

Setting the taint on all nodes is a bit sloppy. If you’d like you can get the same effect a bit more elegantly by setting the taint only on the node from which it was removed. For example:

kubectl taint node ip-10-0-140-186.eu-central-1.compute.internal color=blue:NoSchedule

Take a look and notice that the pod is still running despite the change in taint (this is due to scheduling being a one time activity in the lifecycle of a pod)

  • Terminal 2

NAME                      READY   STATUS    AGE   NODE
myboot-7cbfbd9b89-bzhxw   1/1     Running   18m   devnation-m02

Clean Up

Undeploy the myboot deployment and add again the taint to the node:

kubectl delete -f apps/kubefiles/myboot-deployment.yml

Tolerations

Let’s create a Pod but containing a toleration, so it can be scheduled to a tainted node.

spec:
  tolerations:
  - key: "color"
    operator: "Equal"
    value: "blue"
    effect: "NoSchedule"
  containers:
  - name: myboot
    image: quay.io/rhdevelopers/myboot:v1
  • Terminal 1

kubectl apply -f apps/kubefiles/myboot-toleration.yaml

And then we should see before too long in our watch window our pod get scheduled and advance to the run state

  • Terminal 2

NAME                      READY   STATUS    AGE     NODE
myboot-84b457458b-mbf9r   1/1     Running   3m18s   devnation-m02

Now, although all nodes contain a taint, the Pod is scheduled and run as we defined a tolerance against color=blue taint.

Clean Up

kubectl delete -f apps/kubefiles/myboot-toleration.yaml

NoExecution Taint

So far, you’ve seen the NoSchedule taint effect which means that newly created Pods will not be scheduled there unless they have an overriding toleration. But notice that if we add this taint to a node that already has running/scheduled Pods, this taint will not terminate them.

Let’s change that by using NoExecution effect.

First of all, let’s remove all previous taints.

  • Terminal 1

kubectl taint nodes --all=true color=blue:NoSchedule-

Then deploy another instance of myboot (with no Tolerations):

  • Terminal 1

kubectl apply -f apps/kubefiles/myboot-deployment.yml

We should see the following in the watch

  • Terminal 2

NAME                      READY   STATUS    AGE   NODE
myboot-7cbfbd9b89-wpddg   1/1     Running   47s   devnation-m02

Now let’s taint find the node the pod is running on

  • Terminal 1

NODE=$(kubectl get pod -o jsonpath='{.items[0].spec.nodeName}') (1)
echo ${NODE}
1 the .items[0] is because we’re asking for all pods, but we know our list will contain only one element
"ip-10-0-146-109.eu-central-1.compute.internal"
  • Terminal 1

kubectl taint node ${NODE} color=blue:NoExecute

As soon as we do this, we should be able to watch this "rescheduling" occur in the Terminal 2 watch

  • Terminal 2

NAME                      READY   STATUS              AGE   NODE
myboot-7cbfbd9b89-5t24z   0/1     ContainerCreating   16s   devnation
myboot-7cbfbd9b89-wpddg   1/1     Terminating         65m   devnation-m02

If you have more nodes available then the Pod is terminated and deployed onto another node, if it is not the case, then the Pod will remain in Pending status.

Clean Up

  • Terminal 1

kubectl delete -f apps/kubefiles/myboot-deployment.yml

And remove the NoExecute taint

  • Terminal 1

kubectl taint node ${NODE} color=blue:NoExecute-

Affinity & Anti-Affinity

There is another way of changing where Pods are scheduled using Node/Pod Affinity and Anti-affinity. You can create rules that not only ban where Pods can run but also to favor where they should be run.

In addition to creating affinities between Pods and Nodes, you can also create affinities between Pods. You can decide that a group of Pods should be always be deployed together on the same node(s). Reasons such as significant network communication between Pods and you want to avoid external network calls or perhaps shared storage devices.

Node Affinity

Let’s deploy a new pod with a node affinity. Take a look at myboot-node-affinity.yml (relevant section shown below)

If you’re running this from within VSCode you can use CTRL+p (or CMD+p on Mac OSX) to quickly open myboot-node-affinity.yml

myboot-node-affinity.yml
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution: (1)
        nodeSelectorTerms:
        - matchExpressions:
          - key: color
            operator: In
            values:
            - blue (2)
      containers:
      - name: myboot
        image: quay.io/rhdevelopers/myboot:v1
1 This key highlights that what’s follows must be used during scheduling but not a factor once a pod is executing
2 The matchExpressions is saying this pod has affinity for any node with a color in the value set blue

Now let’s deploy this

  • Terminal 1

kubectl apply -f apps/kubefiles/myboot-node-affinity.yml

And we’ll see in our watch window the pod in a pending state

  • Terminal 2

NAME                      READY   STATUS    AGE   NODE
myboot-546d4d9b45-7vgfc   0/1     Pending   6s    <none>

Let’s create a label on a node matching the affinity expression:

  • Terminal 1 - Minikube

  • Terminal 1 - OpenShift

Get a list of nodes:

kubectl get nodes
NAME                                            STATUS   ROLES    AGE   VERSION
devnation       Ready    control-plane,master   3d    v1.21.2
devnation-m02   Ready    <none>                 42h   v1.21.2

Then pick a node in the list to label (such as the one highlighted)

kubectl label nodes devnation-m02 color=blue (1)
1 Notice that this matches the affinity in the pod
node/devnation-m02 labeled

Get a list of nodes:

kubectl get nodes
NAME                                            STATUS   ROLES    AGE   VERSION
ip-10-0-136-107.eu-central-1.compute.internal   Ready    master   26h   v1.16.2
ip-10-0-140-186.eu-central-1.compute.internal   Ready    worker   26h   v1.16.2
ip-10-0-141-128.eu-central-1.compute.internal   Ready    worker   25h   v1.16.2
ip-10-0-146-109.eu-central-1.compute.internal   Ready    worker   25h   v1.16.2
ip-10-0-150-226.eu-central-1.compute.internal   Ready    worker   26h   v1.16.2
ip-10-0-155-122.eu-central-1.compute.internal   Ready    master   26h   v1.16.2
ip-10-0-162-206.eu-central-1.compute.internal   Ready    worker   26h   v1.16.2
ip-10-0-168-102.eu-central-1.compute.internal   Ready    master   26h   v1.16.2
ip-10-0-175-64.eu-central-1.compute.internal    Ready    worker   25h   v1.16.2

Then pick a node in the list to label (such as the one highlighted)

kubectl label nodes ip-10-0-175-64.eu-central-1.compute.internal color=blue (1)
1 Notice that this matches the affinity in the pod
node/ip-10-0-175-64.eu-central-1.compute.internal labeled

And then in the watch window the output should change to:

  • Terminal 2

NAME                      READY   STATUS              AGE   NODE
myboot-546d4d9b45-7vgfc   0/1     ContainerCreating   15m   devnation-m02

Let’s delete the label from the node that the pod is running on

  • Terminal 1

First find the node the pod is running on

NODE=$(kubectl get pod -o jsonpath='{.items[0].spec.nodeName}') (1)
echo ${NODE}
1 the .items[0] is because we’re asking for all pods, but we know our list will contain only one element

and then remove the color label from it

kubectl label nodes ${NODE} color-

And notice the that watch output is unchanged and if running, the pod will continue to run

  • Terminal 2

NAME                      READY   STATUS    AGE   NODE
myboot-546d4d9b45-7vgfc   1/1     Running   22m   devnation-m02

Since we used the requiredDuringSchedulingIgnoredDuringExecution in the deployment spec for our pod, we got our affinity to work like taints (in the previous section) worked, namely, that the rule is set during the scheduling phase but ignore after that (i.e. once executing). Therefore the Pod is not removed in our case.

This is an example of a hard rule:

Hard Rule

If the Kubernetes scheduler does not find any node with the required label then the Pod reminds in Pending state.

There is also a way to create a soft rule:

Soft Rule

The Kubernetes scheduler attempts to match the rules but if it can. However, if it can’t then the Pod is scheduled to any node.

Consider the example below:

spec:
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution: (1)
      - weight: 1
        preference:
          matchExpressions:
          - key: color
            operator: In
            values:
            - blue
1 You can see the use of the word preferred vs required.

Clean Up

kubectl delete -f apps/kubefiles/myboot-node-affinity.yml

Pod Affinity/Anti-Affinity

Let’s deploy a new pod with a Pod Affinity. See this relevant part of myboot-pod-affinity.yml.

If you’re running this from within VSCode you can use CTRL+p (or CMD+p on Mac OSX) to quickly open myboot-pod-affinity.yml

myboot-pod-affinity.yml
spec:
  affinity:
    podAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - topologyKey: kubernetes.io/hostname (1)
        labelSelector:
          matchExpressions:
          - key: app
            operator: In
            values:
            - myboot (2)
  containers:
1 The node label key. If two nodes are labeled with this key and have identical values, the scheduler treats both nodes as being in the same topology. In this case, hostname is a label that is different for each node.
2 The affinity is with Pods labeled with app=myboot.
  • Terminal 1

kubectl apply -f apps/kubefiles/myboot-pod-affinity.yml
NAME                      READY  STATUS   AGE    NODE
myboot2-7c5f46cbc9-hwm2v  0/1    Pending  5h38m  <none>

The myboot2 Pod is pending as couldn’t find any Pod matching the affinity rule.

To address this, let’s deploy a myboot application labeled with app=myboot.

  • Terminal 1

kubectl apply -f apps/kubefiles/myboot-deployment.yml

And we’ll see that both start up, and run on the same node

  • Terminal 2

NAME                      READY  STATUS             AGE    NODE
myboot-7cbfbd9b89-267k6   0/1    ContainerCreating  5s     devnation-m02
myboot2-7c5f46cbc9-hwm2v  0/1    ContainerCreating  5h45m  devnation-m02

What you’ve just seen is a hard rule, you can use a "soft" rules as well in Pod Affinity.

spec:
  affinity:
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 1
        podAffinityTerm:
          topologyKey: kubernetes.io/hostname
          labelSelector:
            matchExpressions:
            - key: app
              operator: In
              values:
              - myboot

Anti-affinity is used to insure that two Pods do NOT run together on the same node.

Let’s add another pod. Open myboot-pod-antiaffinity.yaml and focus on the following part

If you’re running this from within VSCode you can use CTRL+p (or CMD+p on Mac OSX) to quickly open myboot-pod-antiaffinity.yaml

myboot-pod-antiaffinity.yaml
spec:
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - topologyKey: kubernetes.io/hostname
        labelSelector:
          matchExpressions:
          - key: app
            operator: In
            values:
            - myboot

This basically says that this pod should not be scheduled on any individual node (topologyKey: kubernetes.io/hostname) that has a pod with the app=myboot label.

Deploy a myboot3 with the above anti-affinity rule

  • Terminal 1

kubectl apply -f apps/kubefiles/myboot-pod-antiaffinity.yaml

And then notice what happens in the watch window

  • Terminal 2

NAME                      READY  STATUS             AGE    NODE
myboot-7cbfbd9b89-267k6   1/1    Running            10m    devnation-m02
myboot2-7c5f46cbc9-hwm2v  1/1    Running            5h56m  devnation-m02
myboot3-6f95c866f6-7kvdw  0/1    ContainerCreating  6s     devnation

As you can see from the highlight, the myboot3 Pod is deployed in a different node than the myboot Pod

Clean Up

  • Terminal 1

kubectl delete -f apps/kubefiles/myboot-pod-affinity.yml
kubectl delete -f apps/kubefiles/myboot-pod-antiaffinity.yaml
kubectl delete -f apps/kubefiles/myboot-deployment.yml