Chaos Experiment 2: Unavailable Service
10 MINUTE PRACTICE
Previously, you have seen how big the impact is on the application when a small network latency problem occurs on one non-critital services. Let’s see how the application will occur when this same service is unavailable. For this second experiment, we will test the following hypothesis:
Unavailable 'discount' service should NOT impact the Service Level Objective (SLO) of the Travel Service
Define the steady state
Like in the previous lab, we will keep the same steady state
99% of requests are successful and served within 50 ms
We will be able to see how a different type of failure of the same service could impact the behaviour of our application.
Run the Chaos experiment
In the Kiali Console, from the 'Graph' view, right-click on the 'discounts' service and select 'Details'
You will be redirected to the Service Details page.
Click on the 'Actions' > 'Fault Injection'
Add HTTP Abort by entering the following settings:
Parameter | Value | Description |
---|---|---|
Add HTTP Delay |
Disabled |
|
Add HTTP Abort |
Enabled |
|
Abort Percentage |
10 |
|
HTTP Status Code |
503 |
Click on the 'Update' button
.
10% of the traffic of the 'discounts' service is failing with a 503 HTTP code. Now let’s see the impact of the application.
Analyze the Chaos outcome
In the Chaos Engineering Dashboard, you can see the result of the chaos experiment.
All services, except for the 'discounts' service, performs very well without any errors (100% success).
You can increase the percentage of error injection until making the 'discounts' service completely unavailable.
In the Kiali Console, update the HTTP Abort strategy of the 'discounts' service as follows:
Add HTTP Abort by entering the following settings:
Parameter | Value | Description |
---|---|---|
Add HTTP Delay |
Disabled |
|
Add HTTP Abort |
Enabled |
|
Abort Percentage |
100 |
|
HTTP Status Code |
503 |
Contrary to the outcome with the Latency experiment, you tell the application is resilient when the 'discounts' service is completely down (unavailable). So your hypothesis is validated:
Unavailable 'discounts' services DO NOT impact the Service Level Objective (SLO) of the Travel Service
Rollback the Chaos experiment
In Argo CD, click on 'Sync > Synchronize'
.
Finally, in the Chaos Engineering Dashboard, `*please check the application is back in the steady state.