Chaos Experiment 1: Network latency

30 MINUTE PRACTICE

In production, it is more common to have slow services than broken services. Latency measures the delay between an action and a response. For this first experiment, we will test the following hypothesis:

A small network latency should not impact the Service Level Objective (SLO) of the Travel Service

Define the steady state

In the Chaos Engineering Dashboard, you can analyze the different metrics and define the Steady State for our chaos experiment. First, select the following variables on the dashboard:

Table 1. Dashboard Settings
Parameter Value Description

Namespace

chaos-engineering%USER_ID%

Service

travels.chaos-engineering%USER_ID%.svc.cluster.local

Grafana - Chaos Selection

From the Namespace section, you can tell that 99% of requests are successful and served within 50 ms

Grafana - Steady State Latency

So we will define this SLO as "steady-state".

Click on 'Service Overview' > Edit

Grafana - Edit Service Overview

Then, click on 'Visualization Settings' icon on the left hand sidebar, scroll down to find the 'P99 Latency (Value #D)' rule and enter the following information for Thresholds

Table 2. P99 Latency Thresholds Settings
Parameter Value Description

Thresholds

50,100

Color Mode

Cell

Colors

Green/Yellow/Red (click on the 'invert' button if needed)

Grafana - P99 Latency Threholds

Scroll down again and to find the 'Success Rate (Value #E)' rule and enter the following information for Thresholds

Table 3. Success Rate Thresholds Settings
Parameter Value Description

Thresholds

0.95,0.99

Color Mode

Cell

Colors

Red/Yellow/Green (click on the 'invert' button if needed)

Grafana - Sucess Rate Threholds

Once done, you should have the following outcome (all green).

Grafana - Service Overview Configured

Click on the 'Disk' icon to save and go back to the Dashboard.

Run the Chaos experiment

In the Kiali Console, from the 'Graph' view, right-click on the 'discounts' service (triangle symbol) and select 'Details'

Kiali - Right Click Service

You will be redirected to the Service Details page.

Click on the 'Actions' > 'Fault Injection'

Kiali - Add Fault Injection

Add HTTP Delay by entering the following settings:

Table 4. HTTP Delay Settings
Parameter Value Description

Add HTTP Delay

Enabled

Delay Percentage

5

Fixed Delayed

1s

Kiali - Configure Latency

Click on the 'Update' button.

5% of the traffic of the 'discounts' service has now 1 second of delay.

Analyze the Chaos outcome

Now let’s see the impact of the application.

In the Chaos Engineering Dashboard, you can see the result of the chaos experiment.

Grafana - Latency Fault Overview

From the 'Service Overview' panel or 'Request Duration' for the 'travels' service, you can tell the following about the small network latency based on our hypothesis:

  • there is no impact on the Success Rate of the overall requests (100%)

  • there is a huge impact on the performance of the application.

Indeed, just 1 second of delay on 5% of the traffic of one dependant service induces a latency propagation of ~2 seconds across the entire system.

Grafana - Latency Fault Details

In conclusion, you can tell the application is not resilient to a small network latency. To reduce or fix this phenomenon, you could configure the autoscaling or implement a cache mechanism across the different services of the applications.

Improve the Resiliency

To contain this latency propagation, you are going to apply the Retry pattern to all services calling the delayed 'discounts' services.

Retries can improve the application resiliency against transcient problems such as a temporarily overloaded service or network like we simulate in our experiment.

Instead of failing directly or waiting too long, we could retry N number of times to get the desired output with the desired response time before considering as failed.

Configure the Retry pattern for the following services

  • cars

  • flights

  • hotels

  • insurances

In the Kiali Console, from the 'Services' view, click on the 'cars' service > 'Actions' > 'Request Timeouts'

Add HTTP Retry by entering the following settings:

Table 5. HTTP Retry Settings
Parameter Value Description

Add HTTP Retry

Enabled

Attempts

5

Per Try Timeout

20ms

Kiali - Configure Latency Retry

Click on the 'Update' button.

In the Kiali Console, from the 'Services' view, click on the 'flights' service > 'Actions' > 'Request Timeouts'

Add HTTP Retry by entering the following settings:

Table 6. HTTP Retry Settings
Parameter Value Description

Add HTTP Retry

Enabled

Attempts

5

Per Try Timeout

20ms

Kiali - Configure Latency Retry

Click on the 'Update' button.

In the Kiali Console, from the 'Services' view, click on the 'hotels' service > 'Actions' > 'Request Timeouts'

Add HTTP Retry by entering the following settings:

Table 7. HTTP Retry Settings
Parameter Value Description

Add HTTP Retry

Enabled

Attempts

5

Per Try Timeout

20ms

Kiali - Configure Latency Retry

Click on the 'Update' button.

In the Kiali Console, from the 'Services' view, click on the 'insurances' service > 'Actions' > 'Request Timeouts'

Add HTTP Retry by entering the following settings:

Table 8. HTTP Retry Settings
Parameter Value Description

Add HTTP Retry

Enabled

Attempts

5

Per Try Timeout

20ms

Kiali - Configure Latency Retry

Click on the 'Update' button.

Validate the Improvement

Back into the Chaos Engineering Dashboard, you can tell that we manage to contain the latency propagation by not exceeding 100 ms in general using the Retry pattern while the 'discounts' service still has the 1s latency issue.

Grafana - Latency Contained Overview

You can see more detail on the 'Request Duration' panel for the 'travels' service

Grafana - Latency Contained Details

Rollback the Chaos experiment

There is nothing more simple than rollbacking all configurations you have done during this lab with Argo CD.

In Argo CD, click on 'Sync > Synchronize'.

Argo CD - Sync Application

Finally, in the Chaos Engineering Dashboard, please check the application is back in the steady state.

Grafana - Steady State