Common commands for troubleshooting online problems

Keywords: Programming Java network Linux less

Memory bottleneck

free

free is to view memory usage, including physical memory, swap memory, and kernel buffer memory.

Free-h-s 3 represents the output of memory every three seconds with the following commands

[1014154@cc69dd4c5-4tdb5 ~]$ free
              total        used        free      shared  buff/cache   available
Mem:      119623656    43052220    45611364     4313760    30960072    70574408
Swap:             0           0           0
[1014154@cc69dd4c5-4tdb5 ~]$ free -h -s 3
              total        used        free      shared  buff/cache   available
Mem:           114G         41G         43G        4.1G         29G         67G
Swap:            0B          0B          0B

              total        used        free      shared  buff/cache   available
Mem:           114G         41G         43G        4.1G         29G         67G
Swap:            0B          0B          0B

Mem: is the use of memory.
Swap: is the usage of swap space.
Total: The total amount of available physical memory and swap space for the system.
Used: physical memory and swap space that has already been used.
free: How much physical memory and swap space are available, which is the amount of physical memory that has not really been used.
shared: The size of physical memory used by the share.
buff/cache: The size of physical memory used by buffer s and caches.
Available: The amount of physical memory that can also be used by an application, which is the amount of available memory from an application perspective, available_free + buffer + cache.

Swap space

swap space is an area on disk. When the system is running out of physical memory, Linux saves data that is not frequently accessed in memory to swap, so that the system has more physical memory to serve each process. When the system needs to access the contents stored on swap, it loads the data on swap into memory, which is often called swap out and swap in.swap space can alleviate memory shortages to some extent, but it does not perform very well because it requires reading and writing disk data.

vmstat (recommended)

vmstat (Virtual MeomoryStatistics) is a common tool for monitoring memory in Linux. It can monitor the overall situation of virtual memory, processes, CPU s, etc. of the operating system, and is recommended.

Vmstat 53 means that statistics are made every 5 seconds for a total of three times.

[1014154@cc69dd4c5-4tdb5 ~]$ vmstat 5 3
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 8  0      0 45453212 374768 30763728    0    0    14    99    1    1 11 10 78  0  1
10  0      0 4548 92 32 374768 30763360    0    0     2  1275 95118 97908 13 11 75  0  1
 6  0      0 45452908 374768 30765148    0    0     0  3996 89924 92073 12 10 78  0  1

procs

r: Represents the number of processes running and waiting for CPU time slices (that is, how many processes are actually allocated to the CPU), which, if longer than the number of system CPUs, indicates that the CPU is insufficient and needs to be increased. b: Indicates the number of processes waiting for resources, such as I/O or memory swap.

memory

swpd: indicates the size of memory switched to the memory swap, that is, the size of virtual memory used (in KB). If it is greater than 0, your machine is running out of physical memory. If it is not the cause of the program memory leak, you should either upgrade your memory or migrate your memory-consuming tasks to another machine. Free: Indicates the physical memory that is currently free. Buffer: Represents the size of the buff er. Buffer is usually required for read and write of block devices Cache: Represents the size of the cache, which is usually buffered as a file system. Frequently accessed files will be cached. If the cache value is very large, it means more cached files. If the bi in io is small at this time, it means that the file system is more efficient.

swap

si: indicates that the data is read into memory by disk; in general, it is the size of virtual memory read from disk per second. If this value is greater than 0, it means that physical memory is not enough or memory is leaked, look for memory-consuming processes to resolve. so: Represents the size of data written to disk by memory, that is, entered into memory by the memory swap.

Note: In general, the values of si and so are all 0. If the values of si and so are not 0 for a long time, the system memory is insufficient and the system memory needs to be increased.

io

bi: Represents the total amount of data read by the block device, that is, the read disk, in kb/s bo: Represents the total amount of data written to the block device, i.e. to disk in kb/s

Note: If the value of bi+bo is too large and the wa value is large, it represents a system disk IO bottleneck.

system

in: Represents the number of device terminals observed per second over a time interval. cs: Indicates the number of context switches per second. The smaller the value, the better, it is too large. Consider lowering the number of threads or processes.For example, in web servers such as apache and nginx, when we do performance tests, we usually do thousands or even tens of thousands of concurrent tests. The process of selecting a web server can be downgraded by the peak value of the process or thread, and the process and number of threads can be measured until cs reaches a smaller value.System calls are also, every time a system function is called, our code enters the kernel space, causing context switching, which is a resource-intensive operation and should be avoided as often as possible.Too many context switches means that your CPU wastes most of its time switching contexts, resulting in less time for the CPU to do business and less CPU utilization, which is not desirable.

Note: The larger these two values are, the more CPU s will be consumed by the kernel.

CPU

us: Indicates the percentage of CPU time consumed by user processes. The higher the us value, the more CPU time consumed by user processes. If the CPU time is longer than 50%, then optimization programs or algorithms need to be considered. Sy: Represents the percentage of CPU time consumed by the system's kernel processes. Usually, us+sy should be less than 80%, if more than 80%, indicating that there may be a CPU bottleneck. id: represents the percentage of time that the CPU is in a spatial state. Wa: Indicates the percentage of CPU time spent waiting for IP. The higher wa value, the more serious I/O wait. The reference value of empirical Wa is 20%, if more than 20%, indicating that I/O wait is serious. The reason for I/O wait may be caused by a large number of random reads and writes on the disk, or by the loan bottleneck (mainly block operation) of the disk or monitor.

sar

SAR and free like sar-r 3 output memory information every three seconds:

[root@localhost ~]# sar -r 3
Linux 3.10.0-1062.el7.x86_64 (localhost.localdomain)    2020 28 April 2004  _x86_64_        (2 CPU)

15 40 min 10 seconds kbmemfree kbmemused  %memused kbbuffers  kbcached  kbcommit   %commit  kbactive   kbinact   kbdirty
15 40 min 13 seconds    106800 1314960     92.49      2144    573248   4110864    116.82    563664    498888        36
15 40 min 16 seconds    106816 1314944     92.49      2144    573248   4110864    116.82    563668    498888        36
15 40 min 19 seconds    106816 1314944     92.49      2144    573248   4110864    116.82    563668    498888        36

CPU Bottleneck

View machine cpu cores

Total Number of CPUs = Number of physical CPUs * Number of cores per physical CPU 
Total Logical CPUs = Number of Physical CPUs * Number of cores per Physical CPU * Number of hyperthreads

View CPU information (model)

[1014154@cc69dd4c5-4tdb5 ~]$ cat /proc/cpuinfo | grep name | cut -f2 -d: | uniq -c
     32  Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz

View the number of physical CPU s

[1014154@cc69dd4c5-4tdb5 ~]$ cat /proc/cpuinfo| grep "physical id"| sort| uniq| wc -l
16

View the number of cores (that is, cores) in each physical CPU

[1014154@cc69dd4c5-4tdb5 ~]$ cat /proc/cpuinfo| grep "cpu cores"| uniq
cpu cores       : 2

View the number of logical CPU s

[1014154@cc69dd4c5-4tdb5 ~]$ cat /proc/cpuinfo| grep "processor"| wc -l
32

top

In the operating system of the Linux kernel, processes are dynamically scheduled based on virtual run time (calculated dynamically by process priority, nice value plus actual CPU time).When a process is executed, it needs to be converted from user state to kernel state. User space cannot directly manipulate functions of kernel space.System calls are usually used to complete process scheduling, while the user-to-kernel space conversion is usually done by soft interrupts.For example, to perform disk operations, the user state needs to invoke the disk operation instructions of the kernel through the system, so the time consumed by the CPU is divided into user state CPU consumption, system (kernel) CPU consumption, and disk operation CPU consumption.When a process is executed, it needs a series of operations. First, the process is executed in the user state. During the execution, the process priority is adjusted (nice). Then, the process is invoked to the kernel through system calls. Hard and soft interrupts make the hardware execute tasks.After execution is complete, the system call is returned from the kernel state to the system call, and the system call returns the result to the user state process.

Top can view the total CPU consumption, including individual consumption, such as User, System, Idle, nice, and so on.Shift + H shows java threads; Shift + M sorts by memory usage; Shift + P sorts by CPU usage time (usage); Shift + T sorts by CPU cumulative usage time; multi-core CPUs enter top view 1 to see the load on each CPU.

top - 15:24:11 up 8 days,  7:52,  1 user,  load average: 5.73, 6.85, 7.33
Tasks:  17 total,   1 running,  16 sleeping,   0 stopped,   0 zombie
%Cpu(s): 13.9 us,  9.2 sy,  0.0 ni, 76.1 id,  0.1 wa,  0.0 hi,  0.1 si,  0.7 st
KiB Mem : 11962365+total, 50086832 free, 38312808 used, 31224016 buff/cache
KiB Swap:        0 total,        0 free,        0 used. 75402760 avail Mem

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
   300 ymmapp    20   0 17.242g 1.234g  14732 S   2.3  1.1   9:40.38 java
     1 root      20   0   15376   1988   1392 S   0.0  0.0   0:00.06 sh
    11 root      20   0  120660  11416   1132 S   0.0  0.0   0:04.94 python
    54 root      20   0   85328   2240   1652 S   0.0  0.0   0:00.00 su
    55 ymmapp    20   0   17432   1808   1232 S   0.0  0.0   0:00.00 bash
    56 ymmapp    20   0   17556   2156   1460 S   0.0  0.0   0:00.03 control.sh
    57 ymmapp    20   0   11880    740    576 S   0.0  0.0   0:00.00 tee
   115 ymmapp    20   0   17556   2112   1464 S   0.0  0.0   0:00.02 control_new_war
   133 root      20   0  106032   4240   3160 S   0.0  0.0   0:00.03 sshd
   134 ymmapp    20   0   17080   6872   3180 S   0.0  0.0   0:01.82 ops-updater
   147 ymmapp    20   0   17956   2636   1544 S   0.0  0.0   0:00.07 control.sh
  6538 ymmapp    20   0  115656  10532   3408 S   0.0  0.0   0:00.46 beidou-agent
  6785 ymmapp    20   0 2572996  22512   2788 S   0.0  0.0   0:03.44 gatherinfo4dock
 29241 root      20   0  142148   5712   4340 S   0.0  0.0   0:00.04 sshd
 29243 1014154   20   0  142148   2296    924 S   0.0  0.0   0:00.00 sshd
 29244 1014154   20   0   15208   2020   1640 S   0.0  0.0   0:00.00 bash
 32641 1014154   20   0   57364   2020   1480 R   0.0  0.0   0:00.00 top

Line 1: 15:24:11 up 8 days, 7:52, 1 user, load average: 5.73, 6.85, 7.33: 15:24:11 system time, up 8 days running time, 1 user current number of logged-on users, load average load balancing, respectively, representing 1 minute, 5 minutes, and 15 minutes of load.

Line 2: Tasks: 17 total, 1 running, 16 sleeping, 0 stopped, 0 zombie: Total processes 17, runs 1, hibernates 16, stops 0, zombie processes 0.

Line 3:%Cpu(s): 13.9 us, 9.2 sy, 0.0 ni, 76.1 id, 0.1 wa, 0.0 hi, 0.1 si, 0.7 st: User space cpu accounts for 13.9%, kernel space cpu for 9.2%, priority-changed process cpu for 0%, idle cpu for 76.1%, IO wait for 0.1%, hard interrupt for 0.1%, soft interrupt for 0.1%, and current VM cpu clock is stolen by virtualization for 0.7%.

The fourth and fifth lines represent memory and swap area usage.

The seventh line indicates:

PID:Process id
USER: Process Owner
PR: Process priority
NI:nice value.Negative values indicate high priority and positive values indicate low priority
VIRT: Virtual memory, total virtual memory used by the process, in kb.VIRT=SWAP+RES
RES: Resident memory, the size of physical memory used by processes that has not been swapped out, in kb.RES=CODE+DATA
SHR: Shared memory, shared memory size, in kb
S: Process state.D=Uninterrupted Sleep State R=Run S=Sleep T=Track/Stop Z=Zombie Process
%CPU: Percentage of CPU time consumed since last update
%MEM: Percentage of physical memory used by the process
TIME+: Total CPU time used by the process in 1/100 seconds
COMMAND: Process name (command name/command line)

Calculate the number of uninterruptedsleep tasks in the cpu load

top -b -n 1 | awk '{if (NR<=7)print;else if($8=="D"){print;count++}}END{print "Total status D:"count}'

[root@localhost ~]# top -b -n 1 | awk '{if (NR<=7)print;else if($8=="D"){print;count++}}END{print "Total status D:"count}'
top - 15:35:05 up 1 day, 26 min,  3 users,  load average: 0.00, 0.01, 0.05
Tasks: 225 total,   1 running, 224 sleeping,   0 stopped,   0 zombie
%Cpu(s):  2.5 us, 10.0 sy,  0.0 ni, 87.5 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem :  1421760 total,   104516 free,   777344 used,   539900 buff/cache
KiB Swap:  2097148 total,  2071152 free,    25996 used.   456028 avail Mem

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
Total status D:

sar

Through sar-u 3, you can see the overall consumption of CUP as a percentage:

[root@localhost ~]# sar -u 3
Linux 3.10.0-1062.el7.x86_64 (localhost.localdomain)    2020 01/05  _x86_64_        (2 CPU)

15 18 min 03 seconds     CPU     %user     %nice   %system   %iowait    %steal     %idle
15 18 min 06 seconds     all      0.00      0.00      0.17      0.00      0.00     99.83
15 18 min 09 seconds     all      0.00      0.00      0.17      0.00      0.00     99.83
15 Hour 18 min 12 sec     all      0.17      0.00      0.17      0.00      0.00     99.66
15 18 min 15 seconds     all      0.00      0.00      0.00      0.00      0.00    100.00
15 18 min 18 seconds     all      0.00      0.00      0.00      0.00      0.00    100.00

%user: User space CPU usage.
%nice: CPU usage of processes that have changed priority.
%system: CPU utilization of kernel space.
%iowait: The percentage of CPU s waiting for IO.
%steal: The CPU used by the virtual machine CPU of the virtual machine.
%idle: idle CPU.

Among the above displays, the main ones are%iowait and%idle:

If the value of%iowait is too high, there is an I/O bottleneck on the hard disk.
If the value of%idle is high but the system responds slowly, it may be that the CPU is waiting to allocate memory and should increase the memory capacity.
If the value of%idle remains below 10, the CPU processing power of the system is relatively low, indicating that the most resource to be solved in the system is the CPU.

The most CPU-intensive thread on the positioning line

Dead work

Start a program. arthas-demo is a simple program that generates a random number every second, performs prime factor decomposition, and prints the result.

curl -O https://alibaba.github.io/arthas/arthas-demo.jar
java -jar arthas-demo.jar

[root@localhost ~]# curl -O https://alibaba.github.io/arthas/arthas-demo.jar
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  3743  100  3743    0     0   3022      0  0:00:01  0:00:01 --:--:--  3023
[root@localhost ~]# java -jar arthas-demo.jar
1813=7*7*37
illegalArgumentCount:  1, number is: -180005, need >= 2
illegalArgumentCount:  2, number is: -111175, need >= 2
18505=5*3701
166691=7*23813
105787=11*59*163
60148=2*2*11*1367
196983=3*3*43*509
illegalArgumentCount:  3, number is: -173479, need >= 2
illegalArgumentCount:  4, number is: -112840, need >= 2
39502=2*19751
....

Find the most time-consuming process with the top command

[root@localhost ~]# top
top - 11:11:05 up 20:02,  3 users,  load average: 0.09, 0.07, 0.05
Tasks: 225 total,   1 running, 224 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us,  0.7 sy,  0.0 ni, 99.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem :  1421760 total,   135868 free,   758508 used,   527384 buff/cache
KiB Swap:  2097148 total,  2070640 free,    26508 used.   475852 avail Mem
Change delay from 3.0 to
   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 98344 root      20   0 2422552  23508  12108 S   0.7  1.7   0:00.32 java
     1 root      20   0  194100   6244   3184 S   0.0  0.4   0:20.41 systemd
     2 root      20   0       0      0      0 S   0.0  0.0   0:00.12 kthreadd
     4 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 kworker/0:0H
     6 root      20   0       0      0      0 S   0.0  0.0   0:20.25 ksoftirqd/0

The process number found is 98344.

Find the most CUP-consuming thread in the process

Use the ps-Lp #pid Cu command to see the sort of thread CPU consumption in a process:

[root@localhost ~]# ps -Lp 98344 cu
USER        PID    LWP %CPU NLWP %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root      98344  98344  0.0   10  4.1 2422552 59060 pts/0   Sl+  11:09   0:00 java
root      98344  98345  0.0   10  4.1 2422552 59060 pts/0   Sl+  11:09   0:04 java
root      98344  98346  0.0   10  4.1 2422552 59060 pts/0   Sl+  11:09   0:01 VM Thread
root      98344  98347  0.0   10  4.1 2422552 59060 pts/0   Sl+  11:09   0:00 Reference Handl
root      98344  98348  0.0   10  4.1 2422552 59060 pts/0   Sl+  11:09   0:00 Finalizer
root      98344  98349  0.0   10  4.1 2422552 59060 pts/0   Sl+  11:09   0:00 Signal Dispatch
root      98344  98350  0.0   10  4.1 2422552 59060 pts/0   Sl+  11:09   0:05 C2 CompilerThre
root      98344  98351  0.0   10  4.1 2422552 59060 pts/0   Sl+  11:09   0:00 C1 CompilerThre
root      98344  98352  0.0   10  4.1 2422552 59060 pts/0   Sl+  11:09   0:00 Service Thread
root      98344  98353  0.1   10  4.1 2422552 59060 pts/0   Sl+  11:09   0:19 VM Periodic Tas

Looking at the TIME column, you can see that the thread is consuming more CUP. According to the LWP column, you can see the ID number of the thread, but you need to convert it to hexadecimal to query the thread stack information.

Gets the hexadecimal code of the thread id

Use the printf'%xn'98345 command for the binary conversion:

[root@localhost ~]# printf '%x\n' 98345
18029

View Thread Stack Information

Use jstack to get stack information jstack 98344 | grep-A 10 18029:

[root@localhost ~]# jstack 98344 | grep -A 10 18029
"main" #1 prio=5 os_prio=0 tid=0x00007fb88404b800 nid=0x18029 waiting on condition [0x00007fb88caab000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
        at java.lang.Thread.sleep(Native Method)
        at java.lang.Thread.sleep(Thread.java:340)
        at java.util.concurrent.TimeUnit.sleep(TimeUnit.java:386)
        at demo.MathGame.main(MathGame.java:17)

"VM Thread" os_prio=0 tid=0x00007fb8840f2800 nid=0x1802a runnable

"VM Periodic Task Thread" os_prio=0 tid=0x00007fb884154000 nid=0x18031 waiting on condition

From the command we can see that the corresponding time-consuming code for this thread is demo.MathGame.main(MathGame.java:17)

Grep-C 5 foo file shows the line in the file that matches the foo string and the top and bottom 5 lines
 Grep-B 5 foo file displays Foo and the first 5 lines
 Grep-A 5 foo file displays Foo and the last 5 lines

Network Bottleneck

Locate Packet Loss, Packet Error

Watch more/proc/net/dev is used to locate packet drops and packet errors in order to see network bottlenecks, focusing on drops (packets discarded) and the total number of network packet transfers, and not exceeding network limits:

[root@localhost ~]# watch -n 2 more /proc/net/dev
Every 2.0s: more /proc/net/dev                                                                                                                                                   Fri May  1 17:16:55 2020

Inter-|   Receive                                                |  Transmit
 face |bytes    packets errs drop fifo frame compressed multicast|bytes    packets errs drop fifo colls carrier compressed
    lo:   10025     130    0    0    0     0          0         0    10025     130    0    0    0     0       0          0
 ens33: 759098071  569661    0    0    0     0          0         0 19335572  225551    0    0    0     0       0          0

The leftmost is the name of the interface, Receive is the receiving package, Transmit is the sending package;
Bytes: the number of bytes sent and received;
Packets: indicates the correct amount of packets to send and receive;
errs: indicates the number of packets that received or received errors;
drop: indicates the number of packets dropped;

View addresses where routes pass

traceroute ip can view the addresses that routes pass through, and is often used to count the network time spent in each route segment, such as:

[root@localhost ~]# traceroute 14.215.177.38
traceroute to 14.215.177.38 (14.215.177.38), 30 hops max, 60 byte packets
 1  CD-HZTK5H2.mshome.net (192.168.137.1)  0.126 ms * *
 2  * * *
 3  10.250.112.3 (10.250.112.3)  12.587 ms  12.408 ms  12.317 ms
 4  172.16.227.230 (172.16.227.230)  2.152 ms  2.040 ms  1.956 ms
 5  172.16.227.202 (172.16.227.202)  11.884 ms  11.746 ms  12.692 ms
 6  172.16.227.65 (172.16.227.65)  2.665 ms  3.143 ms  2.923 ms
 7  171.223.206.217 (171.223.206.217)  2.834 ms  2.752 ms  2.654 ms
 8  182.150.18.205 (182.150.18.205)  5.145 ms  5.815 ms  5.542 ms
 9  110.188.6.33 (110.188.6.33)  3.514 ms 171.208.199.185 (171.208.199.185)  3.431 ms 171.208.199.181 (171.208.199.181)  10.768 ms
10  202.97.29.17 (202.97.29.17)  29.574 ms 202.97.30.146 (202.97.30.146)  32.619 ms *
11  113.96.5.126 (113.96.5.126)  36.062 ms 113.96.5.70 (113.96.5.70)  35.940 ms 113.96.4.42 (113.96.4.42)  45.859 ms
12  90.96.135.219.broad.fs.gd.dynamic.163data.com.cn (219.135.96.90)  35.680 ms  35.468 ms  35.304 ms
13  14.215.32.102 (14.215.32.102)  35.135 ms 14.215.32.110 (14.215.32.110)  35.613 ms 14.29.117.242 (14.29.117.242)  54.712 ms
14  * 14.215.32.134 (14.215.32.134)  49.518 ms 14.215.32.122 (14.215.32.122)  47.652 ms
15  * * *
...

View network errors

Netstat-i can view network errors:

[root@localhost ~]# netstat -i
Kernel Interface table
Iface             MTU    RX-OK RX-ERR RX-DRP RX-OVR    TX-OK TX-ERR TX-DRP TX-OVR Flg
ens33            1500   570291      0      0 0        225897      0      0      0 BMRU
lo              65536      130      0      0 0           130      0      0      0 LRU

Iface: Name of the network interface;
MTU: The maximum transmission unit, which limits the maximum length of a data frame and has an upper limit for different network types, such as: the MTU of an Ethernet is 1500;
RX-OK: The correct number of packets received.
RX-ERR: The number of packets that generated errors when received.
RX-DRP: The number of packets dropped at the time of receipt.
RX-OVR: The number of packets lost during reception due to excessive speed (in data transmission, data is lost because the receiving device cannot receive data that is transmitted at the sending rate).
TX-OK: The correct number of packets to send.
TX-ERR: The number of packets that generated errors when sent.
TX-DRP: The number of packets dropped when sending.
TX-OVR: The number of packets lost due to excessive speed when sending.
Flg: Flag, B has set a broadcast address.L The interface is a loopback device.M receives all packets (chaotic mode).N Avoid tracing.O On this interface, ARP is disabled.P This is a point-to-point link.R interface is running.The U interface is in the Active state.

Packet Retransmit Rate

Cat/proc/net/snmp is used to view and analyze network packet volume, traffic, packet errors, and packet dropouts in 240 seconds.RetransSegs and OutSegs are used to calculate the retransmission rate tcpetr=RetransSegs/OutSegs.

[root@localhost ~]# cat /proc/net/snmp
Ip: Forwarding DefaultTTL InReceives InHdrErrors InAddrErrors ForwDatagrams InUnknownProtos InDiscards InDelivers OutRequests OutDiscards OutNoRoutes ReasmTimeout ReasmReqds ReasmOKs ReasmFails FragOKs FragFails FragCreates
Ip: 1 64 241708 0 0 0 0 0 238724 225517 15 0 0 0 0 0 0 0 0
Icmp: InMsgs InErrors InCsumErrors InDestUnreachs InTimeExcds InParmProbs InSrcQuenchs InRedirects InEchos InEchoReps InTimestamps InTimestampReps InAddrMasks InAddrMaskReps OutMsgs OutErrors OutDestUnreachs OutTimeExcds OutParmProbs OutSrcQuenchs OutRedirects OutEchos OutEchoReps OutTimestamps OutTimestampReps OutAddrMasks OutAddrMaskReps
Icmp: 149 0 0 50 99 0 0 0 0 0 0 0 0 0 147 0 147 0 0 0 0 0 0 0 0 0 0
IcmpMsg: InType3 InType11 OutType3
IcmpMsg: 50 99 147
Tcp: RtoAlgorithm RtoMin RtoMax MaxConn ActiveOpens PassiveOpens AttemptFails EstabResets CurrEstab InSegs OutSegs RetransSegs InErrs OutRsts InCsumErrors
Tcp: 1 200 120000 -1 376 6 0 0 4 236711 223186 292 0 4 0
Udp: InDatagrams NoPorts InErrors OutDatagrams RcvbufErrors SndbufErrors InCsumErrors
Udp: 1405 438 0 1896 0 0 0
UdpLite: InDatagrams NoPorts InErrors OutDatagrams RcvbufErrors SndbufErrors InCsumErrors
UdpLite: 0 0 0 0 0 0 0

Retransmission rate = 292/223186_0.13%

Average number of new TCP connections per second: PassiveOpens increment over the last 240 seconds from the / proc/net/snmp file, divided by 240 to get the average increment per second;
Number of TCP connections on the machine: Number of TCP connections obtained through CurrEstab in/proc/net/snmp file;
UDP receive datagrams per second on average: Increment of InDatagrams over the last 240 seconds from/proc/net/snmp file, divided by 240 to get UDP receive datagrams per second on average;
UDP send datagrams per second on average: get the increase of OutDatagrams in the last 240 seconds through/proc/net/snmp file, divide by 240 to get UDP send datagrams per second on average;

Disk Bottleneck

Check disk space

View Disk Remaining Space

View remaining disk space using the df-hl command:

[root@localhost ~]# df -hl
//File System Capacity Used Available%Mountpoint
devtmpfs                       678M     0  678M    0% /dev
tmpfs                          695M     0  695M    0% /dev/shm
tmpfs                          695M   28M  667M    4% /run
tmpfs                          695M     0  695M    0% /sys/fs/cgroup
/dev/mapper/centos_aubin-root   27G  5.6G   22G   21% /
/dev/sda1                     1014M  211M  804M   21% /boot

View Disk Used Space

The du-sh command looks at the usage of disk space, where "used disk space" means the space used by the entire file hierarchy under a specified file. Without a given parameter, Du reports the disk space used by the current directory.This is simply to show the amount of disk space occupied by a file or directory:

[root@localhost ~]# du -sh
64K

-h: Output file system partition usage, such as: 10KB, 10MB, 10GB, etc.
-s: Displays the size of the file or the entire directory in KB by default.

Details of Du can be viewed through man du.

View Disk Read and Write

View overall disk read and write

Pass iostat to see the overall disk read and write:

[root@localhost ~]# iostat
Linux 3.10.0-1062.el7.x86_64 (localhost.localdomain)    2020 02/05  _x86_64_        (2 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.17    0.00    0.20    0.46    0.00   99.17

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda               1.56        30.45        39.61    4659620    6060644
scd0              0.00         0.02         0.00       3102          0
dm-0              1.96        30.01        38.42    4591998    5878155
dm-1              0.09         0.09         0.30      13840      45328

tps: the number of transmissions per second for this device.
kB_read/s: The amount of data read from the device (drive expressed) per second;
kB_wrtn/s: The amount of data written to the device (drive expressed) per second;
kB_read: The total amount of data read;
kB_wrtn: The total amount of data written;

View disk read-write details

By iostat-x 13 You can see the detailed read and write situation of the disk, output three times every second. When you see that I/O wait time accounts for a high proportion of CPU time, the first thing to check is whether the machine is using a lot of swap space, and at the same time pay attention to whether iowait accounts for a large proportion of CPU consumption. If it indicates that there is a large bottleneck on the disk, and pay attention to await, it means that the disk is responding.Should time be less than 5ms:

[root@localhost ~]# iostat -x 1 3
Linux 3.10.0-1062.el7.x86_64 (localhost.localdomain)    2020 02/05  _x86_64_        (2 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.17    0.00    0.20    0.46    0.00   99.16

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.01     0.49    0.63    0.95    30.59    39.78    89.58     0.34  214.23   49.16  323.48   8.55   1.34
scd0              0.00     0.00    0.00    0.00     0.02     0.00    98.48     0.00    1.21    1.21    0.00   0.95   0.00
dm-0              0.00     0.00    0.62    1.35    30.15    38.59    69.70     0.91  460.67   49.12  648.54   6.66   1.31
dm-1              0.00     0.00    0.02    0.07     0.09     0.30     8.52     0.04  442.74   95.43  521.17   6.91   0.06

avg-cpu represents overall CPU usage statistics, where is the average for all CPUs for multicore cpu:

%user: Percentage of time the CPU spends in user mode.
%nice: Percentage of time CPU is in user mode with NICE value.
%system: Percentage of time that the CPU is in system mode.
%iowait: The percentage of time the CPU waits for input and output to complete, and if the value of%iowait is too high, it indicates that there is an I/O bottleneck on the hard disk.
%steal: The percentage of unconscious wait time for a virtual CPU when the hypervisor maintains another virtual processor.
%idle: The percentage of CPU idle time, if the value of%idle is high, indicates that the CPU is idler; if the value of%idle is high but the system responds slowly, the CPU may be waiting to allocate memory and should increase the memory capacity; if the value of%idle remains below 10, indicating that the CPU processing power is relatively low, the resource most needed to solve in the system is the CPU.

Device represents device information:

rrqm/s: The number of times per second read requests to the device are merged, and the file system merges requests to read the same block
wrqm/s: Number of merged write requests to the device per second
r/s: number of reads per second
w/s: number of writes per second
rkB/s: The amount of data read per second in kB units
wkB/s: The amount of data written per second in kB units
avgrq-sz: Average amount of data per IO operation (number of sectors in units)
avgqu-sz: Average IO request queue length waiting to be processed
await: Average wait time per IO request (including wait time and processing time in milliseconds)
svctm: Average processing time per IO request in milliseconds
%util: How much time per second is spent on I/O If%util is close to 100%, there are too many I/O requests and the I/O system is full.idle less than 70% IO pressure is greater, generally read faster wait.

Iostat-xmd 1 3: The new m option allows you to use M as the unit of output.

View processes that consume the most IO

In general, iostat is used to check if there is an IO bottleneck, then the iotop command is used to locate the most expensive IO for that process:

[root@localhost ~]# iotop
Total DISK READ :       0.00 B/s | Total DISK WRITE :       0.00 B/s
Actual DISK READ:       0.00 B/s | Actual DISK WRITE:       0.00 B/s
   TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND
123931 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.02 % [kworker/1:30]
 94208 be/4 xiaolyuh    0.00 B/s    0.00 B/s  0.00 %  0.00 % nautilus-desktop --force [gmain]
     1 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % systemd --system --deserialize 62
     2 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [kthreadd]
 94211 be/4 xiaolyuh    0.00 B/s    0.00 B/s  0.00 %  0.00 % gvfsd-trash --spawner :1.4 /org/gtk/gvfs/exec_spaw/0
     4 be/0 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [kworker/0:0H]
     6 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [ksoftirqd/0]
     7 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [migration/0]
     8 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [rcu_bh]
     9 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [rcu_sched]
    10 be/0 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [lru-add-drain]
...

The iotop-p PID allows you to view the IO of a single process:

[root@localhost ~]# iotop -p 124146
Total DISK READ :       0.00 B/s | Total DISK WRITE :       0.00 B/s
Actual DISK READ:       0.00 B/s | Actual DISK WRITE:       0.00 B/s
   TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND
124146 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % java -jar arthas-demo.jar

Application bottleneck

View the PID of a process

If you look at the pid of a Java process, ps-ef | grep java:

[root@localhost ~]# ps -ef | grep java
root     124146   1984  0 09:13 pts/0    00:00:06 java -jar arthas-demo.jar
root     125210  98378  0 10:07 pts/1    00:00:00 grep --color=auto java

View the number of specific processes

If you look at the number of java processes, ps-ef | grep java | wc-l:

[root@localhost ~]# ps -ef | grep java| wc -l
2

Check threads for deadlocks

To see if a thread is deadlocked, jstack -l pid:

[root@localhost ~]# jstack -l 124146
2020-05-02 10:13:38
Full thread dump OpenJDK 64-Bit Server VM (25.252-b09 mixed mode):

"C1 CompilerThread1" #6 daemon prio=9 os_prio=0 tid=0x00007f27f013c000 nid=0x1e4f9 waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

   Locked ownable synchronizers:
        - None

"C2 CompilerThread0" #5 daemon prio=9 os_prio=0 tid=0x00007f27f012d000 nid=0x1e4f8 waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

   Locked ownable synchronizers:
        - None

"main" #1 prio=5 os_prio=0 tid=0x00007f27f004b800 nid=0x1e4f3 waiting on condition [0x00007f27f7274000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
        at java.lang.Thread.sleep(Native Method)
        at java.lang.Thread.sleep(Thread.java:340)
        at java.util.concurrent.TimeUnit.sleep(TimeUnit.java:386)
        at demo.MathGame.main(MathGame.java:17)

   Locked ownable synchronizers:
        - None
...

View the number of threads in a process

Ps-efL | grep [PID] | wc-l, such as:

[root@localhost ~]# ps -efL | grep 124146 | wc -l
12

See which threads use ps-Lp [pid] cu:

[root@localhost ~]# ps -Lp 124146 cu
USER        PID    LWP %CPU NLWP %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root     124146 124146  0.0   11  2.5 2489116 35724 pts/0   Sl+  09:13   0:00 java
root     124146 124147  0.0   11  2.5 2489116 35724 pts/0   Sl+  09:13   0:01 java
root     124146 124148  0.0   11  2.5 2489116 35724 pts/0   Sl+  09:13   0:00 VM Thread
root     124146 124149  0.0   11  2.5 2489116 35724 pts/0   Sl+  09:13   0:00 Reference Handl
root     124146 124150  0.0   11  2.5 2489116 35724 pts/0   Sl+  09:13   0:00 Finalizer
root     124146 124151  0.0   11  2.5 2489116 35724 pts/0   Sl+  09:13   0:00 Signal Dispatch
root     124146 124152  0.0   11  2.5 2489116 35724 pts/0   Sl+  09:13   0:00 C2 CompilerThre
root     124146 124153  0.0   11  2.5 2489116 35724 pts/0   Sl+  09:13   0:00 C1 CompilerThre
root     124146 124154  0.0   11  2.5 2489116 35724 pts/0   Sl+  09:13   0:00 Service Thread
root     124146 124155  0.1   11  2.5 2489116 35724 pts/0   Sl+  09:13   0:05 VM Periodic Tas
root     124146 125362  0.0   11  2.5 2489116 35724 pts/0   Sl+  10:13   0:00 Attach Listener

Count rows containing Error characters in all log files

Find / -type f-name'*.log'| xargs grep'ERROR', which is useful in troubleshooting problems:

[root@localhost ~]# find / -type f -name "*.log" | xargs grep "ERROR"
/var/log/tuned/tuned.log:2020-03-13 18:05:59,145 ERROR    tuned.utils.commands: Writing to file '/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor' error: '[Errno 19] No such device'
/var/log/tuned/tuned.log:2020-03-13 18:05:59,145 ERROR    tuned.utils.commands: Writing to file '/sys/devices/system/cpu/cpu1/cpufreq/scaling_governor' error: '[Errno 19] No such device'
/var/log/tuned/tuned.log:2020-04-28 14:55:34,857 ERROR    tuned.utils.commands: Writing to file '/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor' error: '[Errno 19] No such device'
/var/log/tuned/tuned.log:2020-04-28 14:55:34,859 ERROR    tuned.utils.commands: Writing to file '/sys/devices/system/cpu/cpu1/cpufreq/scaling_governor' error: '[Errno 19] No such device'
/var/log/tuned/tuned.log:2020-04-28 15:23:19,037 ERROR    tuned.utils.commands: Writing to file '/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor' error: '[Errno 19] No such device'
...

Specify JVM parameters at application startup

Java-jar-Xms128m-Xmx1024m-Xss512k-XX:PermSize=128m-XX:MaxPermSize=64m-XX:NewSize=64m-XX:MaxNewSize=256m arthas-demo.jar, for example:

[root@localhost ~]# java -jar -Xms128m -Xmx1024m -Xss512k -XX:PermSize=128m -XX:MaxPermSize=64m -XX:NewSize=64m -XX:MaxNewSize=256m  arthas-demo.jar
OpenJDK 64-Bit Server VM warning: ignoring option PermSize=128m; support was removed in 8.0
OpenJDK 64-Bit Server VM warning: ignoring option MaxPermSize=64m; support was removed in 8.0
157518=2*3*3*3*2917
illegalArgumentCount:  1, number is: -187733, need >= 2
illegalArgumentCount:  2, number is: -102156, need >= 2
173379=3*57793

summary

When using the linux command, if you want to see help you can use--help or man to view help information:

[root@localhost ~]# grep --help
 Usage: grep [options]... PATTERN [FILE]...
Find PATTERN in each FILE or standard input.
The default PTTERN is a basic regular expression (abbreviated BRE).
For example: grep-i'Hello world'menu.h main.c
...

[root@localhost ~]# man grep

GREP(1)                                                        General Commands Manual                                                        GREP(1)

NAME
       grep, egrep, fgrep - Print rows matching a given pattern

Overview SYNOPSIS
       grep [options] PATTERN [FILE...]
       grep [options] [-e PATTERN | -f FILE] [FILE...]

Describe DESCRIPTION
       Grep searches FILE named file input (or standard input, if no file name is specified, or if the file name given is -) for files containing PATTERN with the given pattern
...

category	Monitoring commands	describe	Remarks
Memory bottleneck	free	View memory usage
	Vmstat 3 (interval) 100 (monitoring times)	Check swap in/out for performance bottlenecks	Recommended Use
	sar -r 3	Similar to the free command, view memory usage, but do not include swap
cpu bottleneck	top -H	Sort by cpu consumption
	Ps-Lp process number cu	View cpu consumption ordering for a process
	cat /proc/cpuinfo \|grep 'processor'\|wc -l	View cpu cores
	top	View total cpu consumption, including sub-consumption such as user,system,idle,nice, etc.
	Top then shift+h: shows the java threads, and shift+M: sorts by memory usage; shift+P: sort by cpu time; shift+T: sort multi-core cpu by cpu cumulative usage time, and enter the top view by "1"	Dedicated performance checks, multi-core CPU s mainly look at the load on each core of CUP
	Sar-u 3 (interval)	View total cpu consumption as a percentage
	sar -q	View cpu load
	top -b -n 1 \| awk '{if (NR<=7)print;else if($8=="D"){print;count++}}END{print "Total status D:"count}'	Calculates the number of uninterruptedsleep tasks in the cpu load. Tasks of uninterruptedsleep are counted in the cpu load, such as disk congestion
Network Bottleneck	cat /var/log/messages	View the Kernel Log to see if the package was lost
	watch more /proc/net/dev	Used to locate packet loss, packet error, to see network bottlenecks	Focus on the total number of drops and network packet transfers that do not exceed the network limit
	sar -n SOCK	View network traffic
	netstat -na\|grep ESTABLISHED\|wc -l	View the number of successful tcp connections	This command consumes cpu in particular and is not suitable for prolonged monitoring data collection
	netstat -na\|awk'{print $6}'\|sort \|uniq -c \|sort -nr	See the number of tcp States
	netstat -i	View network errors
	ss state ESTABLISHED\| wc -l	More efficiently count the number of tcp connection states with ESTABLISHED s
	cat /proc/net/snmp	View and analyze network packet volume, traffic, packet errors, and packet dropouts in 240 seconds	Used to calculate retransmission rate tcpetr=RetransSegs/OutSegs
	ping $ip	Testing network performance
	traceroute $ip	View addresses where routes pass	Commonly used to locate network time-consuming across routing segments
	dig $domain name	View domain name resolution address
	dmesg	View System Kernel Log
Disk Bottleneck	iostat -x -k -d 1	Detailed list of disk reads and writes	When you see that I/O wait time accounts for a high proportion of CPU time, the first thing to check is whether the machine is using a lot of swap space, and at the same time pay attention to whether iowait accounts for a large proportion of CPU consumption, if it indicates that there is a large bottleneck on the disk, and pay attention to await, indicating that the response time of the disk is less than 5 ms
	iostat -x	View the read and write performance of each disk in the system	Focus on the percentage of cpu in await and iowait
	iotop	See which process is reading IO in bulk	In general, iostat is used to see if there is an IO bottleneck before locating which process is reading IO in large quantities
	df -hl	View Disk Remaining Space
	du -sh	See how much space your disk uses
Application bottleneck	ps -ef	grep java	View the id number of a process
	ps -ef \| grep httpd\| wc -l	View the number of specific processes
	cat *.log \| grep *Exception\| wc -l	Statistics log file contains a specific number of exceptions
	jstack -l pid	Used to see if a thread is deadlocked
	awk'{print $8}' 2017-05-22-access_log\|egrep '301\|302'\| wc -l	Count the number of lines of 301 and 302 status codes in log, $8 means column 8 is a status code, which can be changed according to the actual situation	Commonly used to apply fault location
	grep 'wholesaleProductDetailNew' cookie_log \| awk '{if($10=="200")}'print}' \| awk 'print $12' \| more	Print 12 columns of data containing specific data
	Grep "2017:05:22" cookielog \| awk'($12>0.3) {print $12'--"$8}'\| sort > directory address	Sort response times for apache or nginx access logs, $12 indicates response times for 12 lists in the cookie log. Used to troubleshoot if some access lengths cause overall RT lengthening
	grep -v 'HTTP/1.1" 200'	Remove URL s that are not 200 response codes
	Pgm-A-f $Application Cluster Name "grep"'301'log file address \| wc-l"	View the number of 301 status codes in the log s of the entire cluster
	ps -efL \| grep [PID] \| wc -l	View the number of threads created by a process
	find / -type f -name "*.log" \| xargs grep "ERROR"	Count rows containing Error characters in all log files	This is useful when troubleshooting problems
	jstat -gc [pid]	View gc
	jstat -gcnew [pid]	View memory usage in the young area, including MTT (the maximum number of interactions is swapped to the old area), where TT is currently swapped
	jstat -gcold	View memory usage in old zone
	jmap -J-d64 -dump:format=b,file=dump.bin PID	dump out-of-memory snapshot	-J-d64 Prevents jmap from causing crash(jdk6 has bugs)
	-XX:+HeapDumpOnOutOfMemeryError	Join at java startup, store memory snapshots when memory overflow occurs
	jmap -histo [pid]	Sort by object memory size	Note that this will result in full gc
	gcore [pid]	Export completed memory snapshot	Usually used with jmap-permstat/opt/**/java gcore.bin to convert core dump to heap dump
	-XX:HeapDumpPath=/home/logs -Xloggc:/home/log/gc.log -XX:+PrintGCDetails -XX:+PrintGCDateStamps	Add in Java startup parameters, print gc logs
	-server -Xms4000m -Xmx4000m -Xmn1500m -Xss256k -XX:PermSize=340m -XX:MaxPermSize=340m -XX:+UseConcMarkSweepGC	Adjust JVM heap size	xss is stack size

Posted by jdavidbakr on Mon, 04 May 2020 01:08:35 -0700

Programmer Group