Start chaos project from 0 (chaos Toolkit)

Keywords: Back-end

Start Chaos Engineering by ChaosToolkit

Chaos Engineering means Chaos Engineering, which is a technology to test the toughness of a complex system,
Through this technology, the deficiencies in complex systems can be found experimentally, especially in production
By introducing all kinds of chaos into the environment, we can observe the ability of the system to deal with chaos and anomalies, and then build our confidence in the system.
Here, take the AWS open source Chaos Engineering framework chaos toolkit to briefly experience how to carry out a simple chaos project.

Code address

https://gitee.com/lengdanran/chaostoolkit-experiment

Identify target system

Here, I use two simple flash systems

  • DataSourceService: simulate a database service to represent the data source of the whole system
  • ShowDataService: simulate a foreground service that displays data
  • Gateway: simulate Nginx for request forwarding
  • Keeper: the background daemon automatically creates a new service process instance when the service is unavailable

Here, I will start several different processes to simulate the container cluster deployment in the production environment, and improve the availability of the whole system by improving the redundancy of the system. At the same time, use the Gateway to
The client requests are distributed to the small pseudo cluster system.

Prepare Experiment.json experiment plan

The following is an example configuration officially given by the ChaosToolkit

{
    "title": "What is the impact of an expired certificate on our application chain?",
    "description": "If a certificate expires, we should gracefully deal with the issue.",
    "tags": ["tls"],
    "steady-state-hypothesis": {
        "title": "Application responds",
        "probes": [
            {
                "type": "probe",
                "name": "the-astre-service-must-be-running",
                "tolerance": true,
                "provider": {
                    "type": "python",
                    "module": "os.path",
                    "func": "exists",
                    "arguments": {
                        "path": "astre.pid"
                    }
                }
            },
            {
                "type": "probe",
                "name": "the-sunset-service-must-be-running",
                "tolerance": true,
                "provider": {
                    "type": "python",
                    "module": "os.path",
                    "func": "exists",
                    "arguments": {
                        "path": "sunset.pid"
                    }
                }
            },
            {
                "type": "probe",
                "name": "we-can-request-sunset",
                "tolerance": 200,
                "provider": {
                    "type": "http",
                    "timeout": 3,
                    "verify_tls": false,
                    "url": "https://localhost:8443/city/Paris"
                }
            }
        ]
    },
    "method": [
        {
            "type": "action",
            "name": "swap-to-expired-cert",
            "provider": {
                "type": "process",
                "path": "cp",
                "arguments": "expired-cert.pem cert.pem"
            }
        },
        {
            "type": "probe",
            "name": "read-tls-cert-expiry-date",
            "provider": {
                "type": "process",
                "path": "openssl",
                "arguments": "x509 -enddate -noout -in cert.pem"
            }
        },
        {
            "type": "action",
            "name": "restart-astre-service-to-pick-up-certificate",
            "provider": {
                "type": "process",
                "path": "pkill",
                "arguments": "--echo -HUP -F astre.pid"
            }
        },
        {
            "type": "action",
            "name": "restart-sunset-service-to-pick-up-certificate",
            "provider": {
                "type": "process",
                "path": "pkill",
                "arguments": "--echo -HUP -F sunset.pid"
            },
            "pauses": {
                "after": 1
            }
        }
    ],
    "rollbacks": [
        {
            "type": "action",
            "name": "swap-to-valid-cert",
            "provider": {
                "type": "process",
                "path": "cp",
                "arguments": "valid-cert.pem cert.pem"
            }
        },
        {
            "ref": "restart-astre-service-to-pick-up-certificate"
        },
        {
            "ref": "restart-sunset-service-to-pick-up-certificate"
        }
    ]
}

pip install chaostoolkit-lib[jsonpath]

Now let's read the experimental plan in sections.

As can be seen from the above figure, there are not many modules to be configured in this configuration file, including the following six items:

  • title: give a name for this chaotic experiment
  • description: basic overview of this chaotic experiment
  • tags: tags
  • Steady state hypothesis
  • method: define a series of interference behaviors that the experiment will do to the system, mainly including action and probe
  • rollback: after the chaos experiment, the previous operation on the system should be rolled back to restore the system to the state before the experiment (optional)

Obviously, the above six configurations are only the last three important

Steady state hypothesis -- definition of steady state hypothesis

In this module, it defines the parameter index of the system in the steady state of normal operation. For example, when the concurrency reaches 10000QPS, an interface of the system should return code:200. As long as
Under current conditions, if the interface responds normally, we think the system is in normal working state.

This steady-state hypothesis consists of one or more probes and their corresponding fault tolerance ranges. Every time the probe looks for an attribute in our given target system, and judges whether the attribute value is within a reasonable fault tolerance range.

experiment.json file used in the experiment

{
  "title": "<======System Chaos Experiment======>",
  "description": "<===Simple Chaos Experiment By ChaosToolkit===>",
  "tags": [
    "Chaostoolkit Experiment"
  ],
  "steady-state-hypothesis": {
    "title": "System State Before Experiment",
    "probes": [
      {
        "type": "probe",
        "name": "<====System GetData Interface Test====>",
        "tolerance": {
          "type": "jsonpath",
          "path": "$.data",
          "expect": "Handle the get http request method",
          "target": "body"
        },
        "provider": {
          "type": "http",
          "timeout": 20,
          "verify_tls": false,
          "url": "http://localhost:5000/getData"
        }
      },
      {
        "type": "probe",
        "name": "<====System ShowData Interface Test====>",
        "tolerance": {
          "type": "jsonpath",
          "path": "$.data",
          "expect": "Handle the get http request method",
          "target": "body"
        },
        "provider": {
          "type": "http",
          "timeout": 20,
          "verify_tls": false,
          "url": "http://localhost:5000/showData"
        }
      },
      {
        "type": "probe",
        "name": "<=====python module call=====>",
        "tolerance": "this is a test func output",
        "provider": {
          "type": "python",
          "module": "chaostkex.experiment",
          "func": "test",
          "arguments": {}
        }
      }
    ]
  },
  "method": [
    {
      "type": "action",
      "name": "Kill 1 service instance of DataSourceService",
      "provider": {
        "type": "python",
        "module": "chaostkex.experiment",
        "func": "kill_services",
        "arguments": {
          "num": 1,
          "port_file_path": "E:/desktop/chaosmodule/chaostest/ports/dataSourcePort.txt"
        }
      }
    },
    {
      "type": "action",
      "name": "Kill 1 service instance of ShowSourceService",
      "provider": {
        "type": "python",
        "module": "chaostkex.experiment",
        "func": "kill_services",
        "arguments": {
          "num": 1,
          "port_file_path": "E:/desktop/chaosmodule/chaostest/ports/dataShowPort.txt"
        }
      }
    }
  ],
  "rollbacks": []
}

Engineering steps of chaos experiment

The architecture adopted by the system here is relatively simple, the DataSource service is independent of other services, and the chaotic engineering test system provides an external interface http://127.0.0.1:5000/getData and http://127.0.0.1:5000/showData Whether it works normally, the request enters from the gateway, distributes to the server through the gateway, and returns to the caller.

The overall experiment is simple:

  • Kill one process of DataSource and one process of ShowData service respectively, and then check whether the two interfaces opened by the system can work normally

Writing a service driver

In order to enable the chaotoolkit to perform various action s and probe s on the target system during the experiment, it is necessary to customize an experiment driver of the target system for the chaotoolkit. The following is my driver this time:

import os
import platform
from chaosservices import DataSourceService, ShowDataService


def test():
    print("this is a test func output")
    return "this is a test func output"


def kill_services_by_ports(ports: list = []) -> bool:
    sysstr = platform.system()
    if (sysstr == "Windows"):
        try:
            for port in ports:
                with os.popen('netstat -ano|findstr "%d"' % int(port)) as res:
                    res = res.read().split('\n')
                result = []
                for line in res:
                    temp = [i for i in line.split(' ') if i != '']
                    if len(temp) > 4:
                        result.append({'pid': temp[4], 'address': temp[1], 'state': temp[3]})
                for r in result:
                    if int(r['pid']) == 0:
                        continue
                    os.system(command="taskkill /f /pid %d" % int(r['pid']))
        except Exception as e:
            print(e)
            return False

        return True
    else:
        print("Other System tasks")
        for port in ports:
            command = '''kill -9 $(netstat -nlp | grep :''' + \
                      str(port) + ''' | awk '{print $7}' | awk -F"/" '{ print $1 }')'''
            os.system(command)
    return True


def get_ports(port_file_path: str) -> list:
    if port_file_path is None or os.path.exists(port_file_path) is False:
        raise FileNotFoundError
    ports = []
    with open(port_file_path, 'r') as f:
        lines = f.readlines()
    for line in lines:
        if line.strip() != '':
            ports.append(line.strip())
    return list(set(ports))


def kill_services(num: int = 1, port_file_path: str = '') -> bool:
    if num < 1:
        return True
    ports = get_ports(port_file_path=port_file_path)
    cnt = min(num, len(ports))
    for i in range(0, cnt):
        kill_services_by_ports([ports[i]])
    return True


def start_datasource_service(port: int = 8080, portsfile: str = None) -> bool:
    DataSourceService.start(port=port, portsfile=portsfile)
    return True


def start_showdata_service(port: int = 8090, portsfile: str = None) -> bool:
    ShowDataService.start(port=port, portsfile=portsfile)
    return True


if __name__ == '__main__':
    # port_file_path = '../chaosservices/ports/dataSourcePort.txt'
    # kill_services(num=1, port_file_path=port_file_path)
    kill_services_by_ports([8080])

Target system program

DataSource

from typing import Dict

from flask import Flask, request

app = Flask(__name__)


@app.route("/", methods=["GET"])
def getData() -> Dict[str, str]:
    if request.method == "GET":
        return {"data": "Handle the get http request method"}
    else:
        return {"data": "Other methods handled."}


def clear_file(portsfile=None) -> None:
    f = open(portsfile, 'w')
    f.truncate()
    f.close()


def start(host='127.0.0.1', port=8080, portsfile='./ports/dataSourcePort.txt') -> None:
    print("[Info]:\tServe on %s" % str(port))
    clear_file(portsfile=portsfile)
    with open(portsfile, "a+") as f:
        f.write(str(port) + '\n')
    app.run(host=host, port=port, debug=False)


if __name__ == '__main__':
    start(port=8080, portsfile='E:/desktop/chaosmodule/chaostest/ports/dataSourcePort.txt')

ShowDataService

import requests as net_req
from flask import Flask

app = Flask(__name__)

# If the command line startup parameter item is added, the chaotoolkit will not recognize it correctly
# parser = argparse.ArgumentParser(description='manual to this script')
# parser.add_argument("--host", type=str, default="127.0.0.1")
# parser.add_argument("--port", type=int, default=8090)
# parser.add_argument("--portsfile", type=str, default='./ports/showPort.txt')
# args = parser.parse_args()

url = 'http://127.0.0.1:5000/getData'


@app.route('/', methods=['GET'])
def show_data() -> str:
    rsp = net_req.get(url=url)
    print(rsp)
    return rsp.text


def clear_file(portsfile=None) -> None:
    f = open(portsfile, 'w')
    f.truncate()
    f.close()


def start(host='127.0.0.1', port=8090, portsfile='./ports/dataShowPort.txt') -> None:
    print("[Info]:\tServe on %s" % str(port))
    clear_file(portsfile=portsfile)
    with open(portsfile, "a+") as f:
        f.write(str(port) + '\n')
    app.run(host=host, port=port, debug=False)


if __name__ == '__main__':
    start(port=8090, portsfile='E:/desktop/chaosmodule/chaostest/ports/dataShowPort.txt')

Gateway

import requests as net
import json
import sys
from flask import Flask, request

app = Flask(__name__)

# List of data source servers
datasource = []
# Data display front desk service list
datashow = []

datasource_idx = 0
datashow_idx = 0


@app.route('/getData', methods=['GET'])
def get_data() -> str:
    print('[====INFO===]:\tHandle the request from %s' % request.url)
    res = get(urls=datasource)
    return res if res != '' else 'There is no DataSourceService available.'


@app.route('/showData', methods=['GET'])
def show_data() -> str:
    print('[====INFO===]:\tHandle the request from %s' % request.url)
    res = get(urls=datashow)
    return res if res != '' else 'There is no ShowDataService available.'


def get(urls: list) -> str:
    """
    According to the given URL List, request the first feasible URL, And return the response result
    :param urls: url aggregate
    :return: Response string<code>str<code/>
    """
    for url in urls:
        try:
            rsp = net.get(url, timeout=10)
            print('[====INFO====]:\tForward this request to %s' % url)
            return rsp.text
        except Exception as e:
            print("[====EXCEPTION====]:\t%s" % e)
            continue
    return ''


def _get_configuration(file_path='./conf/gateway.json') -> None:
    """
    Load configuration from configuration file
    :param file_path:The path of the configuration file. The default is './conf/gateway.json'
    :return: None
    """
    print('[====INFO====]:\tLoad configuration from file : %s' % file_path)
    with open(file_path) as f:
        conf = json.load(f)
        global datasource, datashow
        datasource = conf["datasource"]
        datashow = conf["datashow"]


if __name__ == '__main__':
    print('[====INFO====]:\tLoads the configuration......')
    try:
        _get_configuration()
    except IOError as error:
        print('[====ERROR====]:\t%s' % error)
        sys.exit(-1)
    print('[====INFO====]:\tStart the Gateway...')
    app.run(host='127.0.0.1', port=5000, debug=False)

Keeper

This part of the program is used to monitor the service status. If the service is unavailable, it can automatically start a new service to make the system work normally

import os
import socket
import time
import DataSourceService, ShowDataService
from multiprocessing import Process


def get_ports(port_file_path: str) -> list:
    if port_file_path is None or os.path.exists(port_file_path) is False:
        raise FileNotFoundError
    ports = []
    with open(port_file_path, 'r') as f:
        lines = f.readlines()
    for line in lines:
        if line.strip() != '':
            ports.append(int(line.strip()))
    return list(set(ports))


def get_available_service(port_file: str = None) -> bool:
    if port_file is None:
        return False
    ports = get_ports(port_file_path=port_file)
    for p in ports:
        if check_port_in_use(port=p):
            return True
    return False


def check_port_in_use(host='127.0.0.1', port=8080) -> bool:
    s = None
    try:
        s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
        s.settimeout(1)
        s.connect((host, int(port)))
        return True
    except socket.error:
        return False
    finally:
        if s:
            s.close()


def creat(func, args):
    p = Process(target=func, args=args)
    p.start()


def start(port_files: list = []) -> None:
    sleep_time = 5
    while True:
        print('Start Checking...')
        # Get the port list corresponding to each service
        port_file = port_files[0]
        # Check if there are available service instances
        if get_available_service(port_file=port_file) is False:
            # There are no service instances available. Create a new instance
            print('[===INFO===]:\t establish DataSourceService example')
            ports = get_ports(port_file_path=port_file)
            if len(ports) == 0:
                last = 8080
            else:
                last = ports[-1]
            new_p = last + 1
            DataSourceService.clear_file(portsfile=port_file)
            creat(func=DataSourceService.start, args=('127.0.0.1', new_p,port_file,))

        port_file = port_files[1]
        # Check if there are available service instances
        if get_available_service(port_file=port_file) is False:
            # There are no service instances available. Create a new instance
            print('[===INFO===]:\t establish ShowDataService example')
            ports = get_ports(port_file_path=port_file)
            if len(ports) == 0:
                last = 8090
            else:
                last = ports[-1]
            new_p = last + 1
            ShowDataService.clear_file(portsfile=port_file)
            creat(func=ShowDataService.start, args=('127.0.0.1', new_p, port_file,))

        time.sleep(sleep_time)


if __name__ == '__main__':
    start(port_files=[
        'E:/desktop/chaosmodule/chaostest/ports/dataSourcePort.txt',
        'E:/desktop/chaosmodule/chaostest/ports/dataShowPort.txt'
    ])

Start experiment

There is a flaw in the system - the Keeper daemon did not start

In this system, just start a Gateway, DataSource and ShowData service. According to the normal experimental logic, the DataSource and ShowData services will be killed. In this way, the interface provided by the system will certainly have problems. The chaotoolkit should detect such an obvious lack of system toughness for us.

$ chaos run experiment.json

Start the target system:

Operation results:

From the running results, we can clearly find that

[2021-12-06 17:31:50 CRITICAL] Steady state probe '<====System GetData Interface Test====>' is not in the given tolerance so failing this experiment

Note the chaostoolkit found that the toughness of the system was insufficient, which was detected at the stage of verifying < = = = = system GetData interface test = = = >

[2021-12-06 17:31:50 INFO] Experiment ended with status: deviated
[2021-12-06 17:31:50 INFO] The steady-state has deviated, a weakness may have been discovered

In the directory where we execute the chaos run command, the journal.json file generated by the experiment will be generated, which contains the detailed report data of the experiment.

Start 2 service instances

The reason for the above lack of toughness is that the service is single instance and the availability is not high. In order to improve the availability, a simple method is to improve the redundancy of the system. In this experiment, I started two service instances for DataSource and ShowData respectively and ran the chaos experiment again

It can be seen that after the redundancy is improved, the system can still operate normally after being injected with interference

Start Keeper daemon

In addition to improving redundancy to solve this problem, you can also start a monitoring process to monitor the service status at any time. Once the service is abnormal, a new service instance can be regenerated to improve availability


It can be seen that the toughness of the system has also been improved!

Posted by steve m on Mon, 06 Dec 2021 23:40:41 -0800