Average load and CPU utilization
In real work, we often confuse the average load and CPU utilization, so here, I also make a distinction.
You may wonder, the average load represents the active input. If the average load is high, doesn't it mean the CPU utilization is high?
We still need to return to the meaning of average load, which refers to the number of processes that are in the running state and non interruptible state in unit time. Therefore, it includes not only processes that are using CPU, but also processes that are waiting for CPU and I/O.
CPU utilization is a statistics of CPU busy conditions per unit time, which does not necessarily correspond to the average load. For example:
• for CPU intensive processes, using a large number of CPUs will lead to an increase in the average workload. At this time, the two are the same.
• I/O intensive processes, waiting for I/O will also lead to an increase in the average load, but CPU utilization is not necessarily high.
• a large number of processes waiting for CPU scheduling will also lead to an increase in the average load, at which time the CPU utilization will be relatively high.
Case study of average load
Next, let's take three examples to see the three situations respectively, and use iostat, mpstat, pidstat and other tools to find out the root cause of the increase of the average load.
The following cases are all based on centos 7.4. Of course, they are also applicable to other Linux systems. The case environment I used is as follows:
Machine configuration: 2 CPU, 1GB memory
#2 Physical CPU s [root@localhost ~]# cat /proc/cpuinfo | grep 'physical id' physical id : 0 physical id : 2
• pre install stress and sysstat packages, such as apt install stress sysstat
yum install epel* yum install stress -y yum install stystat -y
Here, I'll give you a brief introduction - next stress and sysstat
Stress is a Linux system stress testing tool. Here we use it as an exception process to simulate the scenario of average load increase.
sysstat includes common Linux performance tools to monitor and analyze system performance. In our case, we will use the two commands mpstat and pidstat of this package
- mpstat is a common multi-core CPU performance analysis tool, which is used to view the performance indicators of each CPU and the average indicators of all CPUs in real time.
- pidstat is a common process performance analysis tool, which is used to view the CPU, memory, I/O, context switch and other performance indicators of the process in real time.
In addition, each scenario requires you to open three terminals and log in to the same Linux machine.
If all the above requirements have been completed, you can use the uptime command to check the average load before the test:
[root@localhost ~]# uptime 06:47:44 up 13 min, 1 user, load average: 0.00, 0.01, 0.02
Scenario 1: CPU intensive process
First, we run the stress command on the first terminal to simulate a scenario with 100% CPU utilization:
[root@localhost ~]# stress --cpu 1 --timeout 600 stress: info: [1209] dispatching hogs: 1 cpu, 0 io, 0 vm, 0 hdd
Next, run uptime on the second terminal to check the change of the average load:
# -The d parameter indicates the area where the change is highlighted [root@localhost ~]# watch -d uptime Every 2.0s: uptime Thu Jun 25 06:57:24 2020 06:57:24 up 23 min, 3 users, load average: 0.95, 0.63, 0.30
Finally, run mpstat on the third terminal to check the change of CPU utilization:
[root@localhost ~]# mpstat -P ALL 1 2 Linux 3.10.0-693.el7.x86_64 (localhost.localdomain) 06/25/2020 _x86_64_ (2 CPU) 06:56:22 AM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle 06:56:23 AM all 49.75 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 50.25 06:56:23 AM 0 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 06:56:23 AM 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 06:56:23 AM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle 06:56:24 AM all 49.49 0.00 0.51 0.00 0.00 0.00 0.00 0.00 0.00 50.00 06:56:24 AM 0 43.43 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 56.57 06:56:24 AM 1 55.56 0.00 1.01 0.00 0.00 0.00 0.00 0.00 0.00 43.43 Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle Average: all 49.62 0.00 0.25 0.00 0.00 0.00 0.00 0.00 0.00 50.13 Average: 0 71.86 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 28.14 Average: 1 27.50 0.00 0.50 0.00 0.00 0.00 0.00 0.00 0.00 72.00
From terminal 2, we can see that the average load in one minute will slowly increase to 1.00, while from terminal 3, we can see that the utilization rate of one CPU is 100%, but its iowait is only 0. This shows that the increase in average load is due to 100% CPU utilization.
So, which process is responsible for 100% CPU utilization? You can use pidstat to query:
[root@localhost ~]# pidstat 1 2 Linux 3.10.0-693.el7.x86_64 (localhost.localdomain) 06/25/2020 _x86_64_ (2 CPU) 06:59:29 AM UID PID %usr %system %guest %CPU CPU Command 06:59:30 AM 0 1210 98.02 0.00 0.00 98.02 0 stress 06:59:30 AM UID PID %usr %system %guest %CPU CPU Command 06:59:31 AM 0 1210 100.00 0.00 0.00 100.00 1 stress 06:59:31 AM 0 1678 0.00 0.99 0.00 0.99 0 pidstat Average: UID PID %usr %system %guest %CPU CPU Command Average: 0 1210 99.01 0.00 0.00 99.01 - stress Average: 0 1678 0.00 0.50 0.00 0.50 - pidstat
Scenario 2: I/O intensive process
First of all, run the stress command, but this time simulate the I/O pressure, that is, keep executing sync:
#-i --io generates n processes. Each process repeatedly calls sync(), which is used to write the contents of memory to the hard disk [root@localhost ~]# stress --io 1 --timeout 600 stress: info: [1935] dispatching hogs: 0 cpu, 1 io, 0 vm, 0 hdd
Or run uptime on the second terminal to check the change of the average load:
[root@localhost ~]# watch -d uptime Every 2.0s: uptime Thu Jun 25 07:06:58 2020 07:06:58 up 32 min, 3 users, load average: 1.03, 0.75, 0.50
Then, the third terminal runs mpstat to check the change of CPU utilization:
#Display indicators of all CPU s, and output a set of data in 5 seconds interval [root@localhost ~]# mpstat -P ALL 5 2 Linux 3.10.0-693.el7.x86_64 (localhost.localdomain) 06/25/2020 _x86_64_ (2 CPU) 4 13:41:28 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle 13:41:33 all 0.21 0.00 12.07 32.67 0.00 0.21 0.00 0.00 0.00 54.84 13:41:33 0 0.43 0.00 23.87 67.53 0.00 0.43 0.00 0.00 0.00 7.74 13:41:33 1 0.00 0.00 0.81 0.20 0.00 0.00 0.00 0.00 0.00 98.99
From here, it can be seen that the average load of one minute will slowly increase to 1.03, and the CPU utilization rate of one CPU system will increase to 23.87, while iowait is as high as 67.53%. This shows that the increase of average load is due to the increase of iowait.
So which process causes iowait to be so high? We still use pidstat to query:
[root@localhost ~]# pidstat 2 2 Linux 3.10.0-693.el7.x86_64 (localhost.localdomain) 06/25/2020 _x86_64_ (2 CPU) 07:20:43 AM UID PID %usr %system %guest %CPU CPU Command 07:20:45 AM 0 2838 100.00 0.00 0.00 100.00 0 stress 07:20:45 AM 0 2991 0.00 0.50 0.00 0.50 1 pidstat 07:20:45 AM UID PID %usr %system %guest %CPU CPU Command 07:20:47 AM 0 409 0.00 0.50 0.00 0.50 0 xfsaild/dm-0 07:20:47 AM 0 1099 0.00 0.50 0.00 0.50 0 sshd 07:20:47 AM 0 2019 0.50 0.00 0.00 0.50 0 watch 07:20:47 AM 0 2838 98.50 0.00 0.00 98.50 1 stress 07:20:47 AM 0 2951 0.00 0.50 0.00 0.50 0 kworker/0:0
It can be found that it is also caused by the stress process.
Scenario 3: a large number of processes
When the running process in the system exceeds the running capacity of the CPU, there will be processes waiting for the CPU.
For example, we still use Stress, but this time we are simulating 10 processes:
[root@localhost ~]# stress -c 10 --timeout 600 stress: info: [3356] dispatching hogs: 10 cpu, 0 io, 0 vm, 0 hdd