Ali chaos Engineering - a preliminary understanding of chaosblade tool

Ali chaos Engineering - a preliminary understanding of chaosblade tool

1, Foreword

ChaosBlade is an open-source experimental injection tool of Alibaba that follows the principle of chaos engineering and chaos experimental model. It helps enterprises improve the fault tolerance of distributed systems and ensure business continuity in the process of enterprise going to the cloud or migrating to the cloud native system.

Chaosblade is an internal MonkeyKing open source project. It is based on Alibaba's fault test and drill practice in recent ten years and combines the best ideas and practices of the group's businesses.

The ChaosBlade is not only simple to use, but also supports rich experimental scenarios, including:

Basic resources: such as CPU, memory, network, disk, process and other experimental scenarios;

Java applications: such as databases, caches, messages, JVM s themselves, microservices, etc. you can also specify any class method to inject various complex experimental scenarios;
C + + application: for example, specify any method or a line of code injection delay, variable and return value tampering and other experimental scenarios;
Docker container: such as killing container, CPU, memory, network, disk, process and other experimental scenarios in the container;
Cloud original platform: for example, CPU, memory, network, disk and process experiment scenarios on Kubernetes platform nodes, Pod network and Pod itself experiment scenarios such as killing Pod, and container experiment scenarios such as Docker container experiment scenarios above;
Encapsulating the scene into separate projects according to the domain implementation can not only standardize the implementation of the scene in the domain, but also facilitate the horizontal and vertical expansion of the scene. By following the chaotic experimental model, the unified call of chaosblade cli can be realized. The items currently included are as follows:

chaosblade: chaotic experiment management tool, including commands such as creating experiment, destroying experiment, querying experiment, preparing experiment environment and revoking experiment environment. It is an execution tool of chaotic experiment. The execution methods include CLI and HTTP. Provide complete instructions for commands, experimental scenes and scene parameters, and the operation is simple and clear.
Chaosblade spec go: chaotic experimental model is defined in Golang language. Scenes that are easy to implement in Golang language are based on this specification.
Chaosblade exec OS: basic resource experiment scenario implementation.
Chaosblade exec docker: the docker container experiment scenario is implemented by calling the Docker API for standardization.
Chaosblade operator: the Kubernetes platform experiment scenario is implemented. The chaotic experiment is defined through the Kubernetes standard CRD method. It is convenient to use the Kubernetes resource operation method to create, update and delete the experiment scenario, including kubectl, client go and other methods. In addition, it can also be executed using the above chaosblade cli ent tool.
Chaosblade exec JVM: the Java application experiment scenario is implemented. It uses Java Agent technology to dynamically mount without any access and at zero cost. It also supports unloading and completely recycles various resources created by the Agent.
Chaosblade exec cplus: C + + application experimental scenario implementation, using GDB technology to implement method and code line level experimental scenario injection.

2, Installation

chaosblade download address

wget -c
tar -zxvf chaosblade-1.3.0-linux-amd64.tar.gz
  • Environment variable configuration

If the installation directory is / root/chaosblade/chaosblade-1.3.0

cat >> ~/.bash_profile << \EOF
export PATH

source ~/.bash_profile

3, Use

Official documents

The official documents are introduced in detail. Select some command tests below.

[root@node1 ~]# blade create -h
Create a chaos engineering experiment

  blade create [command]

  create, c

blade create cpu load --cpu-percent 60

Available Commands:
  cplus       C++ chaos experiments
  cpu         Cpu experiment
  disk        Disk experiment
  docker      Docker experiment
  druid       Experiment with the Druid
  dubbo       Experiment with the Dubbo
  es          ElasticSearch experiment!
  file        File experiment
  gateway     gateway experiment!
  hbase       hbase experiment!
  http        http experiment
  jedis       jedis experiment
  jvm         Experiment with the JVM
  k8s         Kubernetes experiment
  kafka       kafka experiment
  lettuce     redis client lettuce experiment
  log         log experiment
  mem         Mem experiment
  mongodb     MongoDB experiment
  mysql       mysql experiment
  network     Network experiment
  process     Process experiment
  psql        Postgrelsql experiment
  rabbitmq    rabbitmq experiment
  redisson    redisson experiment
  rocketmq    Rocketmq experiment,can make message send or pull delay and exception
  script      Script chaos experiment
  servlet     java servlet experiment
  strace      strace experiment
  systemd     Systemd experiment
  tars        tars experiment

  -a, --async             whether to create asynchronously, default is false
  -e, --endpoint string   the create result reporting address. It takes effect only when the async value is true and the value is not empty
  -h, --help              help for create
  -n, --nohup             used to internal async create, no need to config
      --uid string        Set Uid for the experiment, adapt to docker

Global Flags:
  -d, --debug   Set client to DEBUG mode

Use "blade create [command] --help" for more information about a command.


1.1 destruction test

blade status --type create
blade destroy 6fa04946baf42920

1.2.2 full load test

The purpose is to verify the CPU's ability of service quality, monitoring alarm, traffic scheduling and elastic scaling under specific load.

blade create cpu fullload --timeout 60

1.3.IO test

Verify the impact of disk io high load on system services, such as monitoring alarm, service stability, etc.

blade create disk burn --read --path /home
blade create disk burn --write --path /home

1.4. Disk full load test

Verify the impact of full disk on system services, such as monitoring alarm, service stability, etc.
The falllocate pre allocated space is used here, and the instantaneous full space (for most file systems, such as ext4 xfs) does not actually occupy space.

df -h /home
blade c disk fill --path /home --percent 80 --retain-handle
df -h /home
df -h /home
blade c disk fill --path /home --reserve 0 --timeout 30
df -h /home

1.5.blade server

Provide http service for remote invocation

blade server start -p 9526

Open firewall port 9526/tcp

firewall-cmd --zone=public --add-port=9526/tcp --permanent
firewall-cmd --reload

Where% 20 is a space




2.1.nginx network delay test

Another client machine can be used to test the network latency, time curl Multiple visits may be involved.

docker pull nginx:latest
docker run --name nginx-test -p 80:80 -d nginx
time curl
blade create docker network delay --time 3000 --interface eth0 --local-port 80 --container-id 70b82b112fac
time curl
blade destroy 0de302011f872a49


  • blade prepare jvm (taking several seconds) is a prerequisite for executing blade create jvm;
  • The blade prepare jvm mounts the java agent on the same process several times. There is only one instance, and the update time changes.
  • In the new version, perform the blade create jvm operation. If there is no running blade prepare jvm instance, it will be created implicitly (an error may be reported in the old version). Therefore, it is recommended to create a blade prepare jvm instance.

3.1.jvm latency

Note: fb5a73661026c4f0 is the ID of the blade create jvm test instance; D54ae242c4538e97 is the implicitly generated blade prepare jvm instance ID

blade status --type prepare --target jvm
blade create jvm delay --time 3000 --classname=com.example.blade.HelloController --methodname=hello  --pid 6428
blade status --type prepare --target jvm
blade destroy fb5a73661026c4f0
blade status --type prepare --target jvm
blade revoke d54ae242c4538e97


Note: 4d5adfcbc7d46956 is the ID of the blade create jvm test instance; C05f768526eaf4db is the explicitly generated blade prepare jvm instance ID

blade prepare jvm  --pid 6428
blade create jvm delay --time 3000 --classname=com.example.blade.HelloController --methodname=hello  --pid 6428
blade status --type prepare --target jvm
blade destroy 4d5adfcbc7d46956"
blade revoke c05f768526eaf4db
blade status --type prepare --target jvm

Posted by Smackie on Sat, 27 Nov 2021 23:21:34 -0800