Start Chaos Engineering by ChaosToolkit
Chaos Engineering means Chaos Engineering, which is a technology to test the toughness of a complex system,
Through this technology, the deficiencies in complex systems can be found experimentally, especially in production
By introducing all kinds of chaos into the environment, we can observe the ability of the system to deal with chaos and anomalies, and then build our confidence in the system.
Here, take the AWS open source Chaos Engineering framework chaos toolkit to briefly experience how to carry out a simple chaos project.
Code address
https://gitee.com/lengdanran/chaostoolkit-experiment
Identify target system
Here, I use two simple flash systems
- DataSourceService: simulate a database service to represent the data source of the whole system
- ShowDataService: simulate a foreground service that displays data
- Gateway: simulate Nginx for request forwarding
- Keeper: the background daemon automatically creates a new service process instance when the service is unavailable
Here, I will start several different processes to simulate the container cluster deployment in the production environment, and improve the availability of the whole system by improving the redundancy of the system. At the same time, use the Gateway to
The client requests are distributed to the small pseudo cluster system.
Prepare Experiment.json experiment plan
The following is an example configuration officially given by the ChaosToolkit
{ "title": "What is the impact of an expired certificate on our application chain?", "description": "If a certificate expires, we should gracefully deal with the issue.", "tags": ["tls"], "steady-state-hypothesis": { "title": "Application responds", "probes": [ { "type": "probe", "name": "the-astre-service-must-be-running", "tolerance": true, "provider": { "type": "python", "module": "os.path", "func": "exists", "arguments": { "path": "astre.pid" } } }, { "type": "probe", "name": "the-sunset-service-must-be-running", "tolerance": true, "provider": { "type": "python", "module": "os.path", "func": "exists", "arguments": { "path": "sunset.pid" } } }, { "type": "probe", "name": "we-can-request-sunset", "tolerance": 200, "provider": { "type": "http", "timeout": 3, "verify_tls": false, "url": "https://localhost:8443/city/Paris" } } ] }, "method": [ { "type": "action", "name": "swap-to-expired-cert", "provider": { "type": "process", "path": "cp", "arguments": "expired-cert.pem cert.pem" } }, { "type": "probe", "name": "read-tls-cert-expiry-date", "provider": { "type": "process", "path": "openssl", "arguments": "x509 -enddate -noout -in cert.pem" } }, { "type": "action", "name": "restart-astre-service-to-pick-up-certificate", "provider": { "type": "process", "path": "pkill", "arguments": "--echo -HUP -F astre.pid" } }, { "type": "action", "name": "restart-sunset-service-to-pick-up-certificate", "provider": { "type": "process", "path": "pkill", "arguments": "--echo -HUP -F sunset.pid" }, "pauses": { "after": 1 } } ], "rollbacks": [ { "type": "action", "name": "swap-to-valid-cert", "provider": { "type": "process", "path": "cp", "arguments": "valid-cert.pem cert.pem" } }, { "ref": "restart-astre-service-to-pick-up-certificate" }, { "ref": "restart-sunset-service-to-pick-up-certificate" } ] }
pip install chaostoolkit-lib[jsonpath]
Now let's read the experimental plan in sections.
As can be seen from the above figure, there are not many modules to be configured in this configuration file, including the following six items:
- title: give a name for this chaotic experiment
- description: basic overview of this chaotic experiment
- tags: tags
- Steady state hypothesis
- method: define a series of interference behaviors that the experiment will do to the system, mainly including action and probe
- rollback: after the chaos experiment, the previous operation on the system should be rolled back to restore the system to the state before the experiment (optional)
Obviously, the above six configurations are only the last three important
Steady state hypothesis -- definition of steady state hypothesis
In this module, it defines the parameter index of the system in the steady state of normal operation. For example, when the concurrency reaches 10000QPS, an interface of the system should return code:200. As long as
Under current conditions, if the interface responds normally, we think the system is in normal working state.
This steady-state hypothesis consists of one or more probes and their corresponding fault tolerance ranges. Every time the probe looks for an attribute in our given target system, and judges whether the attribute value is within a reasonable fault tolerance range.
experiment.json file used in the experiment
{ "title": "<======System Chaos Experiment======>", "description": "<===Simple Chaos Experiment By ChaosToolkit===>", "tags": [ "Chaostoolkit Experiment" ], "steady-state-hypothesis": { "title": "System State Before Experiment", "probes": [ { "type": "probe", "name": "<====System GetData Interface Test====>", "tolerance": { "type": "jsonpath", "path": "$.data", "expect": "Handle the get http request method", "target": "body" }, "provider": { "type": "http", "timeout": 20, "verify_tls": false, "url": "http://localhost:5000/getData" } }, { "type": "probe", "name": "<====System ShowData Interface Test====>", "tolerance": { "type": "jsonpath", "path": "$.data", "expect": "Handle the get http request method", "target": "body" }, "provider": { "type": "http", "timeout": 20, "verify_tls": false, "url": "http://localhost:5000/showData" } }, { "type": "probe", "name": "<=====python module call=====>", "tolerance": "this is a test func output", "provider": { "type": "python", "module": "chaostkex.experiment", "func": "test", "arguments": {} } } ] }, "method": [ { "type": "action", "name": "Kill 1 service instance of DataSourceService", "provider": { "type": "python", "module": "chaostkex.experiment", "func": "kill_services", "arguments": { "num": 1, "port_file_path": "E:/desktop/chaosmodule/chaostest/ports/dataSourcePort.txt" } } }, { "type": "action", "name": "Kill 1 service instance of ShowSourceService", "provider": { "type": "python", "module": "chaostkex.experiment", "func": "kill_services", "arguments": { "num": 1, "port_file_path": "E:/desktop/chaosmodule/chaostest/ports/dataShowPort.txt" } } } ], "rollbacks": [] }
Engineering steps of chaos experiment
The architecture adopted by the system here is relatively simple, the DataSource service is independent of other services, and the chaotic engineering test system provides an external interface http://127.0.0.1:5000/getData and http://127.0.0.1:5000/showData Whether it works normally, the request enters from the gateway, distributes to the server through the gateway, and returns to the caller.
The overall experiment is simple:
- Kill one process of DataSource and one process of ShowData service respectively, and then check whether the two interfaces opened by the system can work normally
Writing a service driver
In order to enable the chaotoolkit to perform various action s and probe s on the target system during the experiment, it is necessary to customize an experiment driver of the target system for the chaotoolkit. The following is my driver this time:
import os import platform from chaosservices import DataSourceService, ShowDataService def test(): print("this is a test func output") return "this is a test func output" def kill_services_by_ports(ports: list = []) -> bool: sysstr = platform.system() if (sysstr == "Windows"): try: for port in ports: with os.popen('netstat -ano|findstr "%d"' % int(port)) as res: res = res.read().split('\n') result = [] for line in res: temp = [i for i in line.split(' ') if i != ''] if len(temp) > 4: result.append({'pid': temp[4], 'address': temp[1], 'state': temp[3]}) for r in result: if int(r['pid']) == 0: continue os.system(command="taskkill /f /pid %d" % int(r['pid'])) except Exception as e: print(e) return False return True else: print("Other System tasks") for port in ports: command = '''kill -9 $(netstat -nlp | grep :''' + \ str(port) + ''' | awk '{print $7}' | awk -F"/" '{ print $1 }')''' os.system(command) return True def get_ports(port_file_path: str) -> list: if port_file_path is None or os.path.exists(port_file_path) is False: raise FileNotFoundError ports = [] with open(port_file_path, 'r') as f: lines = f.readlines() for line in lines: if line.strip() != '': ports.append(line.strip()) return list(set(ports)) def kill_services(num: int = 1, port_file_path: str = '') -> bool: if num < 1: return True ports = get_ports(port_file_path=port_file_path) cnt = min(num, len(ports)) for i in range(0, cnt): kill_services_by_ports([ports[i]]) return True def start_datasource_service(port: int = 8080, portsfile: str = None) -> bool: DataSourceService.start(port=port, portsfile=portsfile) return True def start_showdata_service(port: int = 8090, portsfile: str = None) -> bool: ShowDataService.start(port=port, portsfile=portsfile) return True if __name__ == '__main__': # port_file_path = '../chaosservices/ports/dataSourcePort.txt' # kill_services(num=1, port_file_path=port_file_path) kill_services_by_ports([8080])
Target system program
DataSource
from typing import Dict from flask import Flask, request app = Flask(__name__) @app.route("/", methods=["GET"]) def getData() -> Dict[str, str]: if request.method == "GET": return {"data": "Handle the get http request method"} else: return {"data": "Other methods handled."} def clear_file(portsfile=None) -> None: f = open(portsfile, 'w') f.truncate() f.close() def start(host='127.0.0.1', port=8080, portsfile='./ports/dataSourcePort.txt') -> None: print("[Info]:\tServe on %s" % str(port)) clear_file(portsfile=portsfile) with open(portsfile, "a+") as f: f.write(str(port) + '\n') app.run(host=host, port=port, debug=False) if __name__ == '__main__': start(port=8080, portsfile='E:/desktop/chaosmodule/chaostest/ports/dataSourcePort.txt')
ShowDataService
import requests as net_req from flask import Flask app = Flask(__name__) # If the command line startup parameter item is added, the chaotoolkit will not recognize it correctly # parser = argparse.ArgumentParser(description='manual to this script') # parser.add_argument("--host", type=str, default="127.0.0.1") # parser.add_argument("--port", type=int, default=8090) # parser.add_argument("--portsfile", type=str, default='./ports/showPort.txt') # args = parser.parse_args() url = 'http://127.0.0.1:5000/getData' @app.route('/', methods=['GET']) def show_data() -> str: rsp = net_req.get(url=url) print(rsp) return rsp.text def clear_file(portsfile=None) -> None: f = open(portsfile, 'w') f.truncate() f.close() def start(host='127.0.0.1', port=8090, portsfile='./ports/dataShowPort.txt') -> None: print("[Info]:\tServe on %s" % str(port)) clear_file(portsfile=portsfile) with open(portsfile, "a+") as f: f.write(str(port) + '\n') app.run(host=host, port=port, debug=False) if __name__ == '__main__': start(port=8090, portsfile='E:/desktop/chaosmodule/chaostest/ports/dataShowPort.txt')
Gateway
import requests as net import json import sys from flask import Flask, request app = Flask(__name__) # List of data source servers datasource = [] # Data display front desk service list datashow = [] datasource_idx = 0 datashow_idx = 0 @app.route('/getData', methods=['GET']) def get_data() -> str: print('[====INFO===]:\tHandle the request from %s' % request.url) res = get(urls=datasource) return res if res != '' else 'There is no DataSourceService available.' @app.route('/showData', methods=['GET']) def show_data() -> str: print('[====INFO===]:\tHandle the request from %s' % request.url) res = get(urls=datashow) return res if res != '' else 'There is no ShowDataService available.' def get(urls: list) -> str: """ According to the given URL List, request the first feasible URL, And return the response result :param urls: url aggregate :return: Response string<code>str<code/> """ for url in urls: try: rsp = net.get(url, timeout=10) print('[====INFO====]:\tForward this request to %s' % url) return rsp.text except Exception as e: print("[====EXCEPTION====]:\t%s" % e) continue return '' def _get_configuration(file_path='./conf/gateway.json') -> None: """ Load configuration from configuration file :param file_path:The path of the configuration file. The default is './conf/gateway.json' :return: None """ print('[====INFO====]:\tLoad configuration from file : %s' % file_path) with open(file_path) as f: conf = json.load(f) global datasource, datashow datasource = conf["datasource"] datashow = conf["datashow"] if __name__ == '__main__': print('[====INFO====]:\tLoads the configuration......') try: _get_configuration() except IOError as error: print('[====ERROR====]:\t%s' % error) sys.exit(-1) print('[====INFO====]:\tStart the Gateway...') app.run(host='127.0.0.1', port=5000, debug=False)
Keeper
This part of the program is used to monitor the service status. If the service is unavailable, it can automatically start a new service to make the system work normally
import os import socket import time import DataSourceService, ShowDataService from multiprocessing import Process def get_ports(port_file_path: str) -> list: if port_file_path is None or os.path.exists(port_file_path) is False: raise FileNotFoundError ports = [] with open(port_file_path, 'r') as f: lines = f.readlines() for line in lines: if line.strip() != '': ports.append(int(line.strip())) return list(set(ports)) def get_available_service(port_file: str = None) -> bool: if port_file is None: return False ports = get_ports(port_file_path=port_file) for p in ports: if check_port_in_use(port=p): return True return False def check_port_in_use(host='127.0.0.1', port=8080) -> bool: s = None try: s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) s.settimeout(1) s.connect((host, int(port))) return True except socket.error: return False finally: if s: s.close() def creat(func, args): p = Process(target=func, args=args) p.start() def start(port_files: list = []) -> None: sleep_time = 5 while True: print('Start Checking...') # Get the port list corresponding to each service port_file = port_files[0] # Check if there are available service instances if get_available_service(port_file=port_file) is False: # There are no service instances available. Create a new instance print('[===INFO===]:\t establish DataSourceService example') ports = get_ports(port_file_path=port_file) if len(ports) == 0: last = 8080 else: last = ports[-1] new_p = last + 1 DataSourceService.clear_file(portsfile=port_file) creat(func=DataSourceService.start, args=('127.0.0.1', new_p,port_file,)) port_file = port_files[1] # Check if there are available service instances if get_available_service(port_file=port_file) is False: # There are no service instances available. Create a new instance print('[===INFO===]:\t establish ShowDataService example') ports = get_ports(port_file_path=port_file) if len(ports) == 0: last = 8090 else: last = ports[-1] new_p = last + 1 ShowDataService.clear_file(portsfile=port_file) creat(func=ShowDataService.start, args=('127.0.0.1', new_p, port_file,)) time.sleep(sleep_time) if __name__ == '__main__': start(port_files=[ 'E:/desktop/chaosmodule/chaostest/ports/dataSourcePort.txt', 'E:/desktop/chaosmodule/chaostest/ports/dataShowPort.txt' ])
Start experiment
There is a flaw in the system - the Keeper daemon did not start
In this system, just start a Gateway, DataSource and ShowData service. According to the normal experimental logic, the DataSource and ShowData services will be killed. In this way, the interface provided by the system will certainly have problems. The chaotoolkit should detect such an obvious lack of system toughness for us.
$ chaos run experiment.json
Start the target system:
Operation results:
From the running results, we can clearly find that
[2021-12-06 17:31:50 CRITICAL] Steady state probe '<====System GetData Interface Test====>' is not in the given tolerance so failing this experiment
Note the chaostoolkit found that the toughness of the system was insufficient, which was detected at the stage of verifying < = = = = system GetData interface test = = = >
[2021-12-06 17:31:50 INFO] Experiment ended with status: deviated [2021-12-06 17:31:50 INFO] The steady-state has deviated, a weakness may have been discovered
In the directory where we execute the chaos run command, the journal.json file generated by the experiment will be generated, which contains the detailed report data of the experiment.
Start 2 service instances
The reason for the above lack of toughness is that the service is single instance and the availability is not high. In order to improve the availability, a simple method is to improve the redundancy of the system. In this experiment, I started two service instances for DataSource and ShowData respectively and ran the chaos experiment again
It can be seen that after the redundancy is improved, the system can still operate normally after being injected with interference
Start Keeper daemon
In addition to improving redundancy to solve this problem, you can also start a monitoring process to monitor the service status at any time. Once the service is abnormal, a new service instance can be regenerated to improve availability
It can be seen that the toughness of the system has also been improved!