Building Confidence in the System’s Ability to Respond to Turbulent and Unexpected Conditions

Published in

The Startup

8 min readDec 11, 2020

All systems are subject to failure. Chaos Engineering is a way to guarantee the predictability of your system. Your team must act proactively and have a culture of accepting error as part of the learning process. Chaos Engineering can help to understand the system’s behavior and prevent its effects from being greater. There are several services to perform these tests: Chaos Monkey, Gremlin, and Chaos Toolkit. In this article, we are going to show how one of them works.

Chaos Engineering Sketch - Retrive from Octopus

Introduction

All systems are subject to failure. The ideal scenario would be to resolve these failures before they cause damage to the user. Chaos Engineering purpose a method to tackle this necessity. This concept refers to the Chaos Theory which believes that changes in the initial conditions of dynamic systems can cause irregular results in the long run. The unpredictability of real-world events is a great challenge to large applications like Google, Netflix, Amazon… Chaos Engineering covers not only the behavior of systems but also their consequences for the business.

The main objective is to test the resilience of systems by conducting experiments that inject faults, simulating real conditions. The results of these experiments can lead to the implementation of solutions. Its benefits are diverse, but we can highlight that it leads to improvement actions in the system, is a proactive approach, in addition to understanding the propagation of failures between the components of the system. Chaos Engineering requires control and observation. You need to understand how your system behaves and reacts to certain stimuli. In this text, we will present some tools that enable us to implement Chaos Engineering.

Principles

This concept started to be implemented in large companies like Netflix. While executing a migration to the cloud they had the idea to address the necessity of adequate resilience testing. In the end, they set up a tool that would cause breakdowns in their production environment. The intent was to move from a development model that assumed no breakdowns to a model where breakdowns were considered to be inevitable. This made resilience a necessity rather than an option.

In complex and large-scale systems, there is no way to properly know how a failure can influence the behavior of other components. Chaos Engineering provides a better understanding of your system. As the experiments require the implementation of monitoring and alerting mechanisms, they improve the observability of your system. If your monitoring programs are not triggered by Chaos Engineering experiments, they will probably fail in real situations.

We do not know how a system will behave in specific real-world situations. By simulating scenarios, Chaos Engineering can help to understand the system’s behavior and prevent its effects from being greater. Chaos Engineering experiments lead to actions to improve the system, being a proactive approach. In this way, the experiments lead not only to the identification of faults but also to their removal.

How to Implement

To implement Chaos Engineering, the system must have a minimum of resilience. This is because the purpose of the approach is not to interrupt its operation, nor to prove a hypothesis that is likely to harm the end-user experience. Also, having excellent monitoring and alerting program is essential. Applying Chaos Engineering comes down to the following process:

Defining Current State: You should define a set of measurements and tolerances in order to know if your system is working as expected.
Hypothesis: Here you assume scenarios of how your system should behaves as you intend it to in a specific situation.
Run Experiment: Execute your experiment and observe your system as long as necessary. Document everything in a understandable way in order to allow precise analysis after the experiment.
Learning: In this step, you analyse your findings under consideration of your hypothesis.
Fixes: If you have found any deviations, you should adjust them immediately. After that you should go back to the hypothesis step and run the previous experiment.

Chaos Engineering Flow — Retrive from Oracle Developers Medium

Most Used Softwares

Chaos Monkey is an open-source resiliency tool created by Netflix that helps applications tolerate random instance failures. This tool randomly stops virtual machine instances and containers that run inside of your production environment. Exposing engineers to failures regularly incentivizes them to build resilient services. To use Chaos Monkey is essential to be managing your apps with Spinnaker. This is the continuous delivery platform used at Netflix. Chaos Monkey works with any backend that Spinnaker supports: AWS, Google Compute Engine, Azure, Kubernetes, Cloud Foundry.

Chaos Monkey Logo — Retrive from Chaos Monkey

Gremlin provides you with a framework to simulate real interruptions with a growing library of attacks. Their interface is designed to facilitate this kind of targeted experimentation. You can check the areas you want to apply a test, and you can see graphically which parts of the system are being tested. If things get out of control, there is a kill switch to stop the tests. This makes it easy to find weaknesses in your system before they cause problems for your customers.

Chaos Toolkit is a project whose mission is to provide a free, open, and community-driven toolkit and API to all the various forms of Chaos Engineering tools that the community needs. The main objective of this toolkit is to provide a full chaos engineering implementation that simplifies the adoption of chaos engineering by providing an easy starting point for applying the discipline. Chaos Toolkit also defines an open API with the community so that any chaos experiment can be executed consistently using integrations with the many commercial, private, and open source chaos implementations that are emerging.

Chaos Toolkit Workflow — Retirve from Chaos Toolkit

Use Case

Despite being an approach that speaks very well to serverless, microservices, continuous integration, and other DevOps culture methods and technologies, Chaos Engineering can be applied in any environment, including on-premises monoliths. Because it is open-source and has several integrations, we will simulate an experiment using the Chaos Toolkit.

This experiment shows the impact of degraded service capabilities on our system state. We are trying to see whether or not Kubernetes will indeed kill our unhealthy pod, then start a new one within a given amount of time. Our experiment runs a web application with a single instance monitored by Kubernetes through a /health endpoint. At some point, the /purchase/endpoint is called and it triggers a service failure. We want to use that to see if Kubernetes does pick it up and restart the pod as per the deployments liveness probe.

To run this you will need to install Chaos Toolkit CLI and have access to a Kubernetes cluster:

$ pip install -U chaostoolkit
$ pip install -U chaostoolkit-kubernetes
$ minikube delete
$ minikube start --cpus=2 --memory=2048

Before you can run the experiment itself, you must deploy the application:

$ kubectl create -f webapp-deployment.json -f webapp-service.json

The Chaos Toolkit aims to give you the simplest experience for writing and running your own Chaos Engineering experiments. The key concepts of the Chaos Toolkit are Experiments, Hypothesis and the experiment’s Method. The Method contains a combination of Probes and Actions. The principal concepts are all expressed in an experiment definition. The following experiment is an example from the Chaos Toolkit Samples project:

{
    "version": "1.0.0",
    "title": "Pod should be automatically killed and restarted when unhealthy",
    "description": "Can we trust Kubernetes to restart our microservice when it detects it is unhealthy?",
    "tags": [
        "microservice",
        "kubernetes",
        "python"
    ],
    "configuration": {
        "webapp_service_url":  {
            "type": "env",
            "key": "WEBAPP_SERVICE_ADDR"
        },
        "prometheus_base_url": {
            "type": "env",
            "key": "PROMETHEUS_ADDR"
        }
    },
    "steady-state-hypothesis": {
        "title": "Services are all available and healthy",
        "probes": [
            {
                "type": "probe",
                "name": "all-services-are-healthy",
                "tolerance": true,
                "provider": {
                    "type": "python",
                    "module": "chaosk8s.probes",
                    "func": "all_microservices_healthy"
                }
            },
            {
                "type": "probe",
                "name": "webapp-is-available",
                "tolerance": true,
                "provider": {
                    "type": "python",
                    "module": "chaosk8s.probes",
                    "func": "microservice_available_and_healthy",
                    "arguments": {
                        "name": "webapp-app"
                    }
                }
            }
        ]
    },
    "method": [
        {
            "type": "action",
            "name": "talk-to-webapp",
            "background": true,
            "provider": {
                "type": "process",
                "path": "vegeta",
                "timeout": 63,
                "arguments": {
                    "attack": "",
                    "-duration": "60s",
                    "-connections": "1",
                    "-rate": "1",
                    "-output": "report.bin",
                    "-targets": "urls.txt"
                }
            }
        },
        {
            "type": "action",
            "name": "confirm-purchase",
            "provider": {
                "type": "http",
                "url": "${webapp_service_url}/purchase/confirm"
            },
            "pauses": {
                "before": 15
            }
        },
        {
            "type": "probe",
            "name": "collect-how-many-times-our-service-container-restarted-in-the-last-minute",
            "provider": {
                "type": "python",
                "module": "chaosprometheus.probes",
                "func": "query_interval",
                "arguments": {
                    "query": "kube_pod_container_status_restarts{container=\"webapp-app\"}",
                    "start": "2 minutes ago",
                    "end": "now"
                }
            },
            "pauses": {
                "before": 45
            }
        },
        {
            "type": "probe",
            "name": "read-webapp-logs-for-the-pod-that-was-killed",
            "provider": {
                "type": "python",
                "module": "chaosk8s.probes",
                "func": "read_microservices_logs",
                "arguments": {
                    "name": "webapp-app",
                    "from_previous": true
                }
            }
        },
        {
            "type": "probe",
            "name": "read-webapp-logs-for-pod-that-was-started",
            "provider": {
                "type": "python",
                "module": "chaosk8s.probes",
                "func": "read_microservices_logs",
                "arguments": {
                    "name": "webapp-app"
                }
            }
        },
        {
            "type": "probe",
            "name": "collect-status-code-from-our-webapp-in-the-last-2-minutes",
            "provider": {
                "type": "python",
                "module": "chaosprometheus.probes",
                "func": "query_interval",
                "arguments": {
                    "query": "flask_http_request_duration_seconds_count{path=\"/\"}",
                    "start": "2 minutes ago",
                    "end": "now"
                }
            },
            "pauses": {
                "before": 10
            }
        },
        {
            "type": "probe",
            "name": "plot-request-latency-throughout-experiment",
            "provider": {
                "type": "process",
                "path": "vegeta",
                "timeout": 5,
                "arguments": {
                    "report": "",
                    "-inputs": "report.bin",
                    "-reporter": "plot",
                    "-output": "latency.html"
                }
            },
            "pauses": {
                "before": 5
            }
        }
    ],
    "rollbacks": [
    ]
}

Now you can run the experiment with the following command:

$ chaos run experiment.json

This experiment first looks at the system’s steady-state. It expects that the web application is available and healthy. Then, it starts talking to the application, on its / endpoint in the background for one minute. After a little while, we call the /purchase/confirm endpoint that fails. After that, we expect the pod to be restarted once within the next 30 seconds. We also look at the impact of the HTTP responses during that time. Once the experiment is finished, you may generate a report as per the documentation:

$ chaos report --export-format=pdf chaos-report.json chaos-report.pdf

This command is only available if you installed the chaostoolkit-reporting library.

Conclusion

Chaos Engineering is a way to guarantee the predictability of your system. For this, your team must act proactively and have a culture of accepting error as part of the learning process. There are several services to perform these tests: Chaos Monkey, Gremlin, and Chaos Toolkit. Because it is open-source and has several integrations, we simulate an experiment using the Chaos Toolkit. The key concepts of the Chaos Toolkit are Experiments, Hypotheses, and the experiment’s Method. The Method contains a combination of Probes and Actions. We used those concepts to create an experiment that simulates real conditions.