Chaos Experiment 2: Unavailable Service

10 MINUTE PRACTICE

Previously, you have seen how big the impact is on the application when a small network latency problem occurs on one non-critital services. Let’s see how the application will occur when this same service is unavailable. For this second experiment, we will test the following hypothesis:

Unavailable 'discount' service should NOT impact the Service Level Objective (SLO) of the Travel Service

Define the steady state

Like in the previous lab, we will keep the same steady state

99% of requests are successful and served within 50 ms

Grafana - Service Overview Configured

We will be able to see how a different type of failure of the same service could impact the behaviour of our application.

Run the Chaos experiment

In the Kiali Console, from the 'Graph' view, right-click on the 'discounts' service and select 'Details'

Kiali - Right Click Service

You will be redirected to the Service Details page.

Click on the 'Actions' > 'Fault Injection'

Kiali - Add Fault Injection

Add HTTP Abort by entering the following settings:

Table 1. HTTP Abort Settings
Parameter Value Description

Add HTTP Delay

Disabled

Add HTTP Abort

Enabled

Abort Percentage

10

HTTP Status Code

503

Kiali - Configure Error

Click on the 'Update' button.

10% of the traffic of the 'discounts' service is failing with a 503 HTTP code. Now let’s see the impact of the application.

Analyze the Chaos outcome

In the Chaos Engineering Dashboard, you can see the result of the chaos experiment.

Grafana - Error Fault Overview

All services, except for the 'discounts' service, performs very well without any errors (100% success).

You can increase the percentage of error injection until making the 'discounts' service completely unavailable.

In the Kiali Console, update the HTTP Abort strategy of the 'discounts' service as follows:

Add HTTP Abort by entering the following settings:

Table 2. HTTP Abort Settings
Parameter Value Description

Add HTTP Delay

Disabled

Add HTTP Abort

Enabled

Abort Percentage

100

HTTP Status Code

503

Grafana - Error Fault Overview

Contrary to the outcome with the Latency experiment, you tell the application is resilient when the 'discounts' service is completely down (unavailable). So your hypothesis is validated:

Unavailable 'discounts' services DO NOT impact the Service Level Objective (SLO) of the Travel Service

Rollback the Chaos experiment

In Argo CD, click on 'Sync > Synchronize'.

Argo CD - Sync Application

Finally, in the Chaos Engineering Dashboard, `*please check the application is back in the steady state.

Grafana - Steady State