Case study average load and CPU utilization

Average load and CPU utilization

In real work, we often confuse the average load and CPU utilization, so here, I also make a distinction.

You may wonder, the average load represents the active input. If the average load is high, doesn't it mean the CPU utilization is high?

We still need to return to the meaning of average load, which refers to the number of processes that are in the running state and non interruptible state in unit time. Therefore, it includes not only processes that are using CPU, but also processes that are waiting for CPU and I/O.

CPU utilization is a statistics of CPU busy conditions per unit time, which does not necessarily correspond to the average load. For example:

• for CPU intensive processes, using a large number of CPUs will lead to an increase in the average workload. At this time, the two are the same.

• I/O intensive processes, waiting for I/O will also lead to an increase in the average load, but CPU utilization is not necessarily high.

• a large number of processes waiting for CPU scheduling will also lead to an increase in the average load, at which time the CPU utilization will be relatively high.

Case study of average load

Next, let's take three examples to see the three situations respectively, and use iostat, mpstat, pidstat and other tools to find out the root cause of the increase of the average load.

The following cases are all based on centos 7.4. Of course, they are also applicable to other Linux systems. The case environment I used is as follows:

Machine configuration: 2 CPU, 1GB memory

#2 Physical CPU s
[root@localhost ~]# cat /proc/cpuinfo | grep 'physical id'
physical id	: 0
physical id	: 2

• pre install stress and sysstat packages, such as apt install stress sysstat

  yum install epel*
  yum install stress -y
  yum install stystat -y

Here, I'll give you a brief introduction - next stress and sysstat

Stress is a Linux system stress testing tool. Here we use it as an exception process to simulate the scenario of average load increase.

sysstat includes common Linux performance tools to monitor and analyze system performance. In our case, we will use the two commands mpstat and pidstat of this package

mpstat is a common multi-core CPU performance analysis tool, which is used to view the performance indicators of each CPU and the average indicators of all CPUs in real time.
pidstat is a common process performance analysis tool, which is used to view the CPU, memory, I/O, context switch and other performance indicators of the process in real time.

In addition, each scenario requires you to open three terminals and log in to the same Linux machine.

If all the above requirements have been completed, you can use the uptime command to check the average load before the test:

[root@localhost ~]# uptime
 06:47:44 up 13 min,  1 user,  load average: 0.00, 0.01, 0.02

Scenario 1: CPU intensive process

First, we run the stress command on the first terminal to simulate a scenario with 100% CPU utilization:

[root@localhost ~]# stress --cpu 1 --timeout 600
stress: info: [1209] dispatching hogs: 1 cpu, 0 io, 0 vm, 0 hdd

Next, run uptime on the second terminal to check the change of the average load:

# -The d parameter indicates the area where the change is highlighted
[root@localhost ~]# watch -d uptime
Every 2.0s: uptime                                                                                                     Thu Jun 25 06:57:24 2020

 06:57:24 up 23 min,  3 users,  load average: 0.95, 0.63, 0.30

Finally, run mpstat on the third terminal to check the change of CPU utilization:

[root@localhost ~]# mpstat  -P ALL 1  2
Linux 3.10.0-693.el7.x86_64 (localhost.localdomain) 	06/25/2020 	_x86_64_	(2 CPU)

06:56:22 AM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
06:56:23 AM  all   49.75    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   50.25
06:56:23 AM    0  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
06:56:23 AM    1    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00

06:56:23 AM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
06:56:24 AM  all   49.49    0.00    0.51    0.00    0.00    0.00    0.00    0.00    0.00   50.00
06:56:24 AM    0   43.43    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   56.57
06:56:24 AM    1   55.56    0.00    1.01    0.00    0.00    0.00    0.00    0.00    0.00   43.43

Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
Average:     all   49.62    0.00    0.25    0.00    0.00    0.00    0.00    0.00    0.00   50.13
Average:       0   71.86    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   28.14
Average:       1   27.50    0.00    0.50    0.00    0.00    0.00    0.00    0.00    0.00   72.00

From terminal 2, we can see that the average load in one minute will slowly increase to 1.00, while from terminal 3, we can see that the utilization rate of one CPU is 100%, but its iowait is only 0. This shows that the increase in average load is due to 100% CPU utilization.

So, which process is responsible for 100% CPU utilization? You can use pidstat to query:

[root@localhost ~]# pidstat 1 2
Linux 3.10.0-693.el7.x86_64 (localhost.localdomain) 	06/25/2020 	_x86_64_	(2 CPU)

06:59:29 AM   UID       PID    %usr %system  %guest    %CPU   CPU  Command
06:59:30 AM     0      1210   98.02    0.00    0.00   98.02     0  stress

06:59:30 AM   UID       PID    %usr %system  %guest    %CPU   CPU  Command
06:59:31 AM     0      1210  100.00    0.00    0.00  100.00     1  stress
06:59:31 AM     0      1678    0.00    0.99    0.00    0.99     0  pidstat

Average:      UID       PID    %usr %system  %guest    %CPU   CPU  Command
Average:        0      1210   99.01    0.00    0.00   99.01     -  stress
Average:        0      1678    0.00    0.50    0.00    0.50     -  pidstat

Scenario 2: I/O intensive process

First of all, run the stress command, but this time simulate the I/O pressure, that is, keep executing sync:

#-i --io generates n processes. Each process repeatedly calls sync(), which is used to write the contents of memory to the hard disk
[root@localhost ~]# stress --io  1 --timeout 600
stress: info: [1935] dispatching hogs: 0 cpu, 1 io, 0 vm, 0 hdd

Or run uptime on the second terminal to check the change of the average load:

[root@localhost ~]# watch -d uptime

Every 2.0s: uptime                                                                                                                                                                             Thu Jun 25 07:06:58 2020

 07:06:58 up 32 min,  3 users,  load average: 1.03, 0.75, 0.50

Then, the third terminal runs mpstat to check the change of CPU utilization:

#Display indicators of all CPU s, and output a set of data in 5 seconds interval
[root@localhost ~]# mpstat -P ALL 5 2
Linux 3.10.0-693.el7.x86_64 (localhost.localdomain) 	06/25/2020 	_x86_64_	(2 CPU)
4 13:41:28	CPU	%usr	%nice	%sys	%iowait	%irq	%soft	%steal	%guest	%gnice	%idle
13:41:33	all	0.21	0.00	12.07	32.67	0.00	0.21	0.00	0.00	0.00	54.84
13:41:33	0	0.43	0.00	23.87	67.53	0.00	0.43	0.00	0.00	0.00	7.74
13:41:33	1	0.00	0.00	0.81	0.20	0.00	0.00	0.00	0.00	0.00	98.99

From here, it can be seen that the average load of one minute will slowly increase to 1.03, and the CPU utilization rate of one CPU system will increase to 23.87, while iowait is as high as 67.53%. This shows that the increase of average load is due to the increase of iowait.

So which process causes iowait to be so high? We still use pidstat to query:

[root@localhost ~]# pidstat 2 2
Linux 3.10.0-693.el7.x86_64 (localhost.localdomain) 	06/25/2020 	_x86_64_	(2 CPU)

07:20:43 AM   UID       PID    %usr %system  %guest    %CPU   CPU  Command
07:20:45 AM     0      2838  100.00    0.00    0.00  100.00     0  stress
07:20:45 AM     0      2991    0.00    0.50    0.00    0.50     1  pidstat

07:20:45 AM   UID       PID    %usr %system  %guest    %CPU   CPU  Command
07:20:47 AM     0       409    0.00    0.50    0.00    0.50     0  xfsaild/dm-0
07:20:47 AM     0      1099    0.00    0.50    0.00    0.50     0  sshd
07:20:47 AM     0      2019    0.50    0.00    0.00    0.50     0  watch
07:20:47 AM     0      2838   98.50    0.00    0.00   98.50     1  stress
07:20:47 AM     0      2951    0.00    0.50    0.00    0.50     0  kworker/0:0

It can be found that it is also caused by the stress process.

Scenario 3: a large number of processes

When the running process in the system exceeds the running capacity of the CPU, there will be processes waiting for the CPU.

For example, we still use Stress, but this time we are simulating 10 processes:

[root@localhost ~]# stress -c  10 --timeout 600
stress: info: [3356] dispatching hogs: 10 cpu, 0 io, 0 vm, 0 hdd

Posted by mrjoseph.com on Thu, 25 Jun 2020 05:06:47 -0700

Programmer Group