Memory bottleneck
free
free is to view memory usage, including physical memory, swap memory, and kernel buffer memory.
Free-h-s 3 represents the output of memory every three seconds with the following commands
[1014154@cc69dd4c5-4tdb5 ~]$ free total used free shared buff/cache available Mem: 119623656 43052220 45611364 4313760 30960072 70574408 Swap: 0 0 0 [1014154@cc69dd4c5-4tdb5 ~]$ free -h -s 3 total used free shared buff/cache available Mem: 114G 41G 43G 4.1G 29G 67G Swap: 0B 0B 0B total used free shared buff/cache available Mem: 114G 41G 43G 4.1G 29G 67G Swap: 0B 0B 0B
- Mem: is the use of memory.
- Swap: is the usage of swap space.
- Total: The total amount of available physical memory and swap space for the system.
- Used: physical memory and swap space that has already been used.
- free: How much physical memory and swap space are available, which is the amount of physical memory that has not really been used.
- shared: The size of physical memory used by the share.
- buff/cache: The size of physical memory used by buffer s and caches.
- Available: The amount of physical memory that can also be used by an application, which is the amount of available memory from an application perspective, available_free + buffer + cache.
Swap space
swap space is an area on disk. When the system is running out of physical memory, Linux saves data that is not frequently accessed in memory to swap, so that the system has more physical memory to serve each process. When the system needs to access the contents stored on swap, it loads the data on swap into memory, which is often called swap out and swap in.swap space can alleviate memory shortages to some extent, but it does not perform very well because it requires reading and writing disk data.
vmstat (recommended)
vmstat (Virtual MeomoryStatistics) is a common tool for monitoring memory in Linux. It can monitor the overall situation of virtual memory, processes, CPU s, etc. of the operating system, and is recommended.
Vmstat 53 means that statistics are made every 5 seconds for a total of three times.
[1014154@cc69dd4c5-4tdb5 ~]$ vmstat 5 3 procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 8 0 0 45453212 374768 30763728 0 0 14 99 1 1 11 10 78 0 1 10 0 0 4548 92 32 374768 30763360 0 0 2 1275 95118 97908 13 11 75 0 1 6 0 0 45452908 374768 30765148 0 0 0 3996 89924 92073 12 10 78 0 1
procs
r: Represents the number of processes running and waiting for CPU time slices (that is, how many processes are actually allocated to the CPU), which, if longer than the number of system CPUs, indicates that the CPU is insufficient and needs to be increased. b: Indicates the number of processes waiting for resources, such as I/O or memory swap.
memory
swpd: indicates the size of memory switched to the memory swap, that is, the size of virtual memory used (in KB). If it is greater than 0, your machine is running out of physical memory. If it is not the cause of the program memory leak, you should either upgrade your memory or migrate your memory-consuming tasks to another machine. Free: Indicates the physical memory that is currently free. Buffer: Represents the size of the buff er. Buffer is usually required for read and write of block devices Cache: Represents the size of the cache, which is usually buffered as a file system. Frequently accessed files will be cached. If the cache value is very large, it means more cached files. If the bi in io is small at this time, it means that the file system is more efficient.
swap
si: indicates that the data is read into memory by disk; in general, it is the size of virtual memory read from disk per second. If this value is greater than 0, it means that physical memory is not enough or memory is leaked, look for memory-consuming processes to resolve. so: Represents the size of data written to disk by memory, that is, entered into memory by the memory swap.
Note: In general, the values of si and so are all 0. If the values of si and so are not 0 for a long time, the system memory is insufficient and the system memory needs to be increased.
io
bi: Represents the total amount of data read by the block device, that is, the read disk, in kb/s bo: Represents the total amount of data written to the block device, i.e. to disk in kb/s
Note: If the value of bi+bo is too large and the wa value is large, it represents a system disk IO bottleneck.
system
in: Represents the number of device terminals observed per second over a time interval. cs: Indicates the number of context switches per second. The smaller the value, the better, it is too large. Consider lowering the number of threads or processes.For example, in web servers such as apache and nginx, when we do performance tests, we usually do thousands or even tens of thousands of concurrent tests. The process of selecting a web server can be downgraded by the peak value of the process or thread, and the process and number of threads can be measured until cs reaches a smaller value.System calls are also, every time a system function is called, our code enters the kernel space, causing context switching, which is a resource-intensive operation and should be avoided as often as possible.Too many context switches means that your CPU wastes most of its time switching contexts, resulting in less time for the CPU to do business and less CPU utilization, which is not desirable.
Note: The larger these two values are, the more CPU s will be consumed by the kernel.
CPU
us: Indicates the percentage of CPU time consumed by user processes. The higher the us value, the more CPU time consumed by user processes. If the CPU time is longer than 50%, then optimization programs or algorithms need to be considered. Sy: Represents the percentage of CPU time consumed by the system's kernel processes. Usually, us+sy should be less than 80%, if more than 80%, indicating that there may be a CPU bottleneck. id: represents the percentage of time that the CPU is in a spatial state. Wa: Indicates the percentage of CPU time spent waiting for IP. The higher wa value, the more serious I/O wait. The reference value of empirical Wa is 20%, if more than 20%, indicating that I/O wait is serious. The reason for I/O wait may be caused by a large number of random reads and writes on the disk, or by the loan bottleneck (mainly block operation) of the disk or monitor.
sar
SAR and free like sar-r 3 output memory information every three seconds:
[root@localhost ~]# sar -r 3 Linux 3.10.0-1062.el7.x86_64 (localhost.localdomain) 2020 28 April 2004 _x86_64_ (2 CPU) 15 40 min 10 seconds kbmemfree kbmemused %memused kbbuffers kbcached kbcommit %commit kbactive kbinact kbdirty 15 40 min 13 seconds 106800 1314960 92.49 2144 573248 4110864 116.82 563664 498888 36 15 40 min 16 seconds 106816 1314944 92.49 2144 573248 4110864 116.82 563668 498888 36 15 40 min 19 seconds 106816 1314944 92.49 2144 573248 4110864 116.82 563668 498888 36
CPU Bottleneck
View machine cpu cores
Total Number of CPUs = Number of physical CPUs * Number of cores per physical CPU Total Logical CPUs = Number of Physical CPUs * Number of cores per Physical CPU * Number of hyperthreads
View CPU information (model)
[1014154@cc69dd4c5-4tdb5 ~]$ cat /proc/cpuinfo | grep name | cut -f2 -d: | uniq -c 32 Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz
View the number of physical CPU s
[1014154@cc69dd4c5-4tdb5 ~]$ cat /proc/cpuinfo| grep "physical id"| sort| uniq| wc -l 16
View the number of cores (that is, cores) in each physical CPU
[1014154@cc69dd4c5-4tdb5 ~]$ cat /proc/cpuinfo| grep "cpu cores"| uniq cpu cores : 2
View the number of logical CPU s
[1014154@cc69dd4c5-4tdb5 ~]$ cat /proc/cpuinfo| grep "processor"| wc -l 32
top
In the operating system of the Linux kernel, processes are dynamically scheduled based on virtual run time (calculated dynamically by process priority, nice value plus actual CPU time).When a process is executed, it needs to be converted from user state to kernel state. User space cannot directly manipulate functions of kernel space.System calls are usually used to complete process scheduling, while the user-to-kernel space conversion is usually done by soft interrupts.For example, to perform disk operations, the user state needs to invoke the disk operation instructions of the kernel through the system, so the time consumed by the CPU is divided into user state CPU consumption, system (kernel) CPU consumption, and disk operation CPU consumption.When a process is executed, it needs a series of operations. First, the process is executed in the user state. During the execution, the process priority is adjusted (nice). Then, the process is invoked to the kernel through system calls. Hard and soft interrupts make the hardware execute tasks.After execution is complete, the system call is returned from the kernel state to the system call, and the system call returns the result to the user state process.
Top can view the total CPU consumption, including individual consumption, such as User, System, Idle, nice, and so on.Shift + H shows java threads; Shift + M sorts by memory usage; Shift + P sorts by CPU usage time (usage); Shift + T sorts by CPU cumulative usage time; multi-core CPUs enter top view 1 to see the load on each CPU.
top - 15:24:11 up 8 days, 7:52, 1 user, load average: 5.73, 6.85, 7.33 Tasks: 17 total, 1 running, 16 sleeping, 0 stopped, 0 zombie %Cpu(s): 13.9 us, 9.2 sy, 0.0 ni, 76.1 id, 0.1 wa, 0.0 hi, 0.1 si, 0.7 st KiB Mem : 11962365+total, 50086832 free, 38312808 used, 31224016 buff/cache KiB Swap: 0 total, 0 free, 0 used. 75402760 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 300 ymmapp 20 0 17.242g 1.234g 14732 S 2.3 1.1 9:40.38 java 1 root 20 0 15376 1988 1392 S 0.0 0.0 0:00.06 sh 11 root 20 0 120660 11416 1132 S 0.0 0.0 0:04.94 python 54 root 20 0 85328 2240 1652 S 0.0 0.0 0:00.00 su 55 ymmapp 20 0 17432 1808 1232 S 0.0 0.0 0:00.00 bash 56 ymmapp 20 0 17556 2156 1460 S 0.0 0.0 0:00.03 control.sh 57 ymmapp 20 0 11880 740 576 S 0.0 0.0 0:00.00 tee 115 ymmapp 20 0 17556 2112 1464 S 0.0 0.0 0:00.02 control_new_war 133 root 20 0 106032 4240 3160 S 0.0 0.0 0:00.03 sshd 134 ymmapp 20 0 17080 6872 3180 S 0.0 0.0 0:01.82 ops-updater 147 ymmapp 20 0 17956 2636 1544 S 0.0 0.0 0:00.07 control.sh 6538 ymmapp 20 0 115656 10532 3408 S 0.0 0.0 0:00.46 beidou-agent 6785 ymmapp 20 0 2572996 22512 2788 S 0.0 0.0 0:03.44 gatherinfo4dock 29241 root 20 0 142148 5712 4340 S 0.0 0.0 0:00.04 sshd 29243 1014154 20 0 142148 2296 924 S 0.0 0.0 0:00.00 sshd 29244 1014154 20 0 15208 2020 1640 S 0.0 0.0 0:00.00 bash 32641 1014154 20 0 57364 2020 1480 R 0.0 0.0 0:00.00 top
Line 1: 15:24:11 up 8 days, 7:52, 1 user, load average: 5.73, 6.85, 7.33: 15:24:11 system time, up 8 days running time, 1 user current number of logged-on users, load average load balancing, respectively, representing 1 minute, 5 minutes, and 15 minutes of load.
Line 2: Tasks: 17 total, 1 running, 16 sleeping, 0 stopped, 0 zombie: Total processes 17, runs 1, hibernates 16, stops 0, zombie processes 0.
Line 3:%Cpu(s): 13.9 us, 9.2 sy, 0.0 ni, 76.1 id, 0.1 wa, 0.0 hi, 0.1 si, 0.7 st: User space cpu accounts for 13.9%, kernel space cpu for 9.2%, priority-changed process cpu for 0%, idle cpu for 76.1%, IO wait for 0.1%, hard interrupt for 0.1%, soft interrupt for 0.1%, and current VM cpu clock is stolen by virtualization for 0.7%.
The fourth and fifth lines represent memory and swap area usage.
The seventh line indicates:
- PID:Process id
- USER: Process Owner
- PR: Process priority
- NI:nice value.Negative values indicate high priority and positive values indicate low priority
- VIRT: Virtual memory, total virtual memory used by the process, in kb.VIRT=SWAP+RES
- RES: Resident memory, the size of physical memory used by processes that has not been swapped out, in kb.RES=CODE+DATA
- SHR: Shared memory, shared memory size, in kb
- S: Process state.D=Uninterrupted Sleep State R=Run S=Sleep T=Track/Stop Z=Zombie Process
- %CPU: Percentage of CPU time consumed since last update
- %MEM: Percentage of physical memory used by the process
- TIME+: Total CPU time used by the process in 1/100 seconds
- COMMAND: Process name (command name/command line)
Calculate the number of uninterruptedsleep tasks in the cpu load
top -b -n 1 | awk '{if (NR<=7)print;else if($8=="D"){print;count++}}END{print "Total status D:"count}'
[root@localhost ~]# top -b -n 1 | awk '{if (NR<=7)print;else if($8=="D"){print;count++}}END{print "Total status D:"count}' top - 15:35:05 up 1 day, 26 min, 3 users, load average: 0.00, 0.01, 0.05 Tasks: 225 total, 1 running, 224 sleeping, 0 stopped, 0 zombie %Cpu(s): 2.5 us, 10.0 sy, 0.0 ni, 87.5 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem : 1421760 total, 104516 free, 777344 used, 539900 buff/cache KiB Swap: 2097148 total, 2071152 free, 25996 used. 456028 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND Total status D:
sar
Through sar-u 3, you can see the overall consumption of CUP as a percentage:
[root@localhost ~]# sar -u 3 Linux 3.10.0-1062.el7.x86_64 (localhost.localdomain) 2020 01/05 _x86_64_ (2 CPU) 15 18 min 03 seconds CPU %user %nice %system %iowait %steal %idle 15 18 min 06 seconds all 0.00 0.00 0.17 0.00 0.00 99.83 15 18 min 09 seconds all 0.00 0.00 0.17 0.00 0.00 99.83 15 Hour 18 min 12 sec all 0.17 0.00 0.17 0.00 0.00 99.66 15 18 min 15 seconds all 0.00 0.00 0.00 0.00 0.00 100.00 15 18 min 18 seconds all 0.00 0.00 0.00 0.00 0.00 100.00
- %user: User space CPU usage.
- %nice: CPU usage of processes that have changed priority.
- %system: CPU utilization of kernel space.
- %iowait: The percentage of CPU s waiting for IO.
- %steal: The CPU used by the virtual machine CPU of the virtual machine.
- %idle: idle CPU.
Among the above displays, the main ones are%iowait and%idle:
- If the value of%iowait is too high, there is an I/O bottleneck on the hard disk.
- If the value of%idle is high but the system responds slowly, it may be that the CPU is waiting to allocate memory and should increase the memory capacity.
- If the value of%idle remains below 10, the CPU processing power of the system is relatively low, indicating that the most resource to be solved in the system is the CPU.
The most CPU-intensive thread on the positioning line
Dead work
Start a program. arthas-demo is a simple program that generates a random number every second, performs prime factor decomposition, and prints the result.
curl -O https://alibaba.github.io/arthas/arthas-demo.jar java -jar arthas-demo.jar
[root@localhost ~]# curl -O https://alibaba.github.io/arthas/arthas-demo.jar % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 3743 100 3743 0 0 3022 0 0:00:01 0:00:01 --:--:-- 3023 [root@localhost ~]# java -jar arthas-demo.jar 1813=7*7*37 illegalArgumentCount: 1, number is: -180005, need >= 2 illegalArgumentCount: 2, number is: -111175, need >= 2 18505=5*3701 166691=7*23813 105787=11*59*163 60148=2*2*11*1367 196983=3*3*43*509 illegalArgumentCount: 3, number is: -173479, need >= 2 illegalArgumentCount: 4, number is: -112840, need >= 2 39502=2*19751 ....
Find the most time-consuming process with the top command
[root@localhost ~]# top top - 11:11:05 up 20:02, 3 users, load average: 0.09, 0.07, 0.05 Tasks: 225 total, 1 running, 224 sleeping, 0 stopped, 0 zombie %Cpu(s): 0.0 us, 0.7 sy, 0.0 ni, 99.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem : 1421760 total, 135868 free, 758508 used, 527384 buff/cache KiB Swap: 2097148 total, 2070640 free, 26508 used. 475852 avail Mem Change delay from 3.0 to PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 98344 root 20 0 2422552 23508 12108 S 0.7 1.7 0:00.32 java 1 root 20 0 194100 6244 3184 S 0.0 0.4 0:20.41 systemd 2 root 20 0 0 0 0 S 0.0 0.0 0:00.12 kthreadd 4 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/0:0H 6 root 20 0 0 0 0 S 0.0 0.0 0:20.25 ksoftirqd/0
The process number found is 98344.
Find the most CUP-consuming thread in the process
Use the ps-Lp #pid Cu command to see the sort of thread CPU consumption in a process:
[root@localhost ~]# ps -Lp 98344 cu USER PID LWP %CPU NLWP %MEM VSZ RSS TTY STAT START TIME COMMAND root 98344 98344 0.0 10 4.1 2422552 59060 pts/0 Sl+ 11:09 0:00 java root 98344 98345 0.0 10 4.1 2422552 59060 pts/0 Sl+ 11:09 0:04 java root 98344 98346 0.0 10 4.1 2422552 59060 pts/0 Sl+ 11:09 0:01 VM Thread root 98344 98347 0.0 10 4.1 2422552 59060 pts/0 Sl+ 11:09 0:00 Reference Handl root 98344 98348 0.0 10 4.1 2422552 59060 pts/0 Sl+ 11:09 0:00 Finalizer root 98344 98349 0.0 10 4.1 2422552 59060 pts/0 Sl+ 11:09 0:00 Signal Dispatch root 98344 98350 0.0 10 4.1 2422552 59060 pts/0 Sl+ 11:09 0:05 C2 CompilerThre root 98344 98351 0.0 10 4.1 2422552 59060 pts/0 Sl+ 11:09 0:00 C1 CompilerThre root 98344 98352 0.0 10 4.1 2422552 59060 pts/0 Sl+ 11:09 0:00 Service Thread root 98344 98353 0.1 10 4.1 2422552 59060 pts/0 Sl+ 11:09 0:19 VM Periodic Tas
Looking at the TIME column, you can see that the thread is consuming more CUP. According to the LWP column, you can see the ID number of the thread, but you need to convert it to hexadecimal to query the thread stack information.
Gets the hexadecimal code of the thread id
Use the printf'%xn'98345 command for the binary conversion:
[root@localhost ~]# printf '%x\n' 98345 18029
View Thread Stack Information
Use jstack to get stack information jstack 98344 | grep-A 10 18029:
[root@localhost ~]# jstack 98344 | grep -A 10 18029 "main" #1 prio=5 os_prio=0 tid=0x00007fb88404b800 nid=0x18029 waiting on condition [0x00007fb88caab000] java.lang.Thread.State: TIMED_WAITING (sleeping) at java.lang.Thread.sleep(Native Method) at java.lang.Thread.sleep(Thread.java:340) at java.util.concurrent.TimeUnit.sleep(TimeUnit.java:386) at demo.MathGame.main(MathGame.java:17) "VM Thread" os_prio=0 tid=0x00007fb8840f2800 nid=0x1802a runnable "VM Periodic Task Thread" os_prio=0 tid=0x00007fb884154000 nid=0x18031 waiting on condition
From the command we can see that the corresponding time-consuming code for this thread is demo.MathGame.main(MathGame.java:17)
Grep-C 5 foo file shows the line in the file that matches the foo string and the top and bottom 5 lines Grep-B 5 foo file displays Foo and the first 5 lines Grep-A 5 foo file displays Foo and the last 5 lines
Network Bottleneck
Locate Packet Loss, Packet Error
Watch more/proc/net/dev is used to locate packet drops and packet errors in order to see network bottlenecks, focusing on drops (packets discarded) and the total number of network packet transfers, and not exceeding network limits:
[root@localhost ~]# watch -n 2 more /proc/net/dev Every 2.0s: more /proc/net/dev Fri May 1 17:16:55 2020 Inter-| Receive | Transmit face |bytes packets errs drop fifo frame compressed multicast|bytes packets errs drop fifo colls carrier compressed lo: 10025 130 0 0 0 0 0 0 10025 130 0 0 0 0 0 0 ens33: 759098071 569661 0 0 0 0 0 0 19335572 225551 0 0 0 0 0 0
- The leftmost is the name of the interface, Receive is the receiving package, Transmit is the sending package;
- Bytes: the number of bytes sent and received;
- Packets: indicates the correct amount of packets to send and receive;
- errs: indicates the number of packets that received or received errors;
- drop: indicates the number of packets dropped;
View addresses where routes pass
traceroute ip can view the addresses that routes pass through, and is often used to count the network time spent in each route segment, such as:
[root@localhost ~]# traceroute 14.215.177.38 traceroute to 14.215.177.38 (14.215.177.38), 30 hops max, 60 byte packets 1 CD-HZTK5H2.mshome.net (192.168.137.1) 0.126 ms * * 2 * * * 3 10.250.112.3 (10.250.112.3) 12.587 ms 12.408 ms 12.317 ms 4 172.16.227.230 (172.16.227.230) 2.152 ms 2.040 ms 1.956 ms 5 172.16.227.202 (172.16.227.202) 11.884 ms 11.746 ms 12.692 ms 6 172.16.227.65 (172.16.227.65) 2.665 ms 3.143 ms 2.923 ms 7 171.223.206.217 (171.223.206.217) 2.834 ms 2.752 ms 2.654 ms 8 182.150.18.205 (182.150.18.205) 5.145 ms 5.815 ms 5.542 ms 9 110.188.6.33 (110.188.6.33) 3.514 ms 171.208.199.185 (171.208.199.185) 3.431 ms 171.208.199.181 (171.208.199.181) 10.768 ms 10 202.97.29.17 (202.97.29.17) 29.574 ms 202.97.30.146 (202.97.30.146) 32.619 ms * 11 113.96.5.126 (113.96.5.126) 36.062 ms 113.96.5.70 (113.96.5.70) 35.940 ms 113.96.4.42 (113.96.4.42) 45.859 ms 12 90.96.135.219.broad.fs.gd.dynamic.163data.com.cn (219.135.96.90) 35.680 ms 35.468 ms 35.304 ms 13 14.215.32.102 (14.215.32.102) 35.135 ms 14.215.32.110 (14.215.32.110) 35.613 ms 14.29.117.242 (14.29.117.242) 54.712 ms 14 * 14.215.32.134 (14.215.32.134) 49.518 ms 14.215.32.122 (14.215.32.122) 47.652 ms 15 * * * ...
View network errors
Netstat-i can view network errors:
[root@localhost ~]# netstat -i Kernel Interface table Iface MTU RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP TX-OVR Flg ens33 1500 570291 0 0 0 225897 0 0 0 BMRU lo 65536 130 0 0 0 130 0 0 0 LRU
- Iface: Name of the network interface;
- MTU: The maximum transmission unit, which limits the maximum length of a data frame and has an upper limit for different network types, such as: the MTU of an Ethernet is 1500;
- RX-OK: The correct number of packets received.
- RX-ERR: The number of packets that generated errors when received.
- RX-DRP: The number of packets dropped at the time of receipt.
- RX-OVR: The number of packets lost during reception due to excessive speed (in data transmission, data is lost because the receiving device cannot receive data that is transmitted at the sending rate).
- TX-OK: The correct number of packets to send.
- TX-ERR: The number of packets that generated errors when sent.
- TX-DRP: The number of packets dropped when sending.
- TX-OVR: The number of packets lost due to excessive speed when sending.
- Flg: Flag, B has set a broadcast address.L The interface is a loopback device.M receives all packets (chaotic mode).N Avoid tracing.O On this interface, ARP is disabled.P This is a point-to-point link.R interface is running.The U interface is in the Active state.
Packet Retransmit Rate
Cat/proc/net/snmp is used to view and analyze network packet volume, traffic, packet errors, and packet dropouts in 240 seconds.RetransSegs and OutSegs are used to calculate the retransmission rate tcpetr=RetransSegs/OutSegs.
[root@localhost ~]# cat /proc/net/snmp Ip: Forwarding DefaultTTL InReceives InHdrErrors InAddrErrors ForwDatagrams InUnknownProtos InDiscards InDelivers OutRequests OutDiscards OutNoRoutes ReasmTimeout ReasmReqds ReasmOKs ReasmFails FragOKs FragFails FragCreates Ip: 1 64 241708 0 0 0 0 0 238724 225517 15 0 0 0 0 0 0 0 0 Icmp: InMsgs InErrors InCsumErrors InDestUnreachs InTimeExcds InParmProbs InSrcQuenchs InRedirects InEchos InEchoReps InTimestamps InTimestampReps InAddrMasks InAddrMaskReps OutMsgs OutErrors OutDestUnreachs OutTimeExcds OutParmProbs OutSrcQuenchs OutRedirects OutEchos OutEchoReps OutTimestamps OutTimestampReps OutAddrMasks OutAddrMaskReps Icmp: 149 0 0 50 99 0 0 0 0 0 0 0 0 0 147 0 147 0 0 0 0 0 0 0 0 0 0 IcmpMsg: InType3 InType11 OutType3 IcmpMsg: 50 99 147 Tcp: RtoAlgorithm RtoMin RtoMax MaxConn ActiveOpens PassiveOpens AttemptFails EstabResets CurrEstab InSegs OutSegs RetransSegs InErrs OutRsts InCsumErrors Tcp: 1 200 120000 -1 376 6 0 0 4 236711 223186 292 0 4 0 Udp: InDatagrams NoPorts InErrors OutDatagrams RcvbufErrors SndbufErrors InCsumErrors Udp: 1405 438 0 1896 0 0 0 UdpLite: InDatagrams NoPorts InErrors OutDatagrams RcvbufErrors SndbufErrors InCsumErrors UdpLite: 0 0 0 0 0 0 0
Retransmission rate = 292/223186_0.13%
- Average number of new TCP connections per second: PassiveOpens increment over the last 240 seconds from the / proc/net/snmp file, divided by 240 to get the average increment per second;
- Number of TCP connections on the machine: Number of TCP connections obtained through CurrEstab in/proc/net/snmp file;
- UDP receive datagrams per second on average: Increment of InDatagrams over the last 240 seconds from/proc/net/snmp file, divided by 240 to get UDP receive datagrams per second on average;
- UDP send datagrams per second on average: get the increase of OutDatagrams in the last 240 seconds through/proc/net/snmp file, divide by 240 to get UDP send datagrams per second on average;
Disk Bottleneck
Check disk space
View Disk Remaining Space
View remaining disk space using the df-hl command:
[root@localhost ~]# df -hl //File System Capacity Used Available%Mountpoint devtmpfs 678M 0 678M 0% /dev tmpfs 695M 0 695M 0% /dev/shm tmpfs 695M 28M 667M 4% /run tmpfs 695M 0 695M 0% /sys/fs/cgroup /dev/mapper/centos_aubin-root 27G 5.6G 22G 21% / /dev/sda1 1014M 211M 804M 21% /boot
View Disk Used Space
The du-sh command looks at the usage of disk space, where "used disk space" means the space used by the entire file hierarchy under a specified file. Without a given parameter, Du reports the disk space used by the current directory.This is simply to show the amount of disk space occupied by a file or directory:
[root@localhost ~]# du -sh 64K
- -h: Output file system partition usage, such as: 10KB, 10MB, 10GB, etc.
- -s: Displays the size of the file or the entire directory in KB by default.
Details of Du can be viewed through man du.
View Disk Read and Write
View overall disk read and write
Pass iostat to see the overall disk read and write:
[root@localhost ~]# iostat Linux 3.10.0-1062.el7.x86_64 (localhost.localdomain) 2020 02/05 _x86_64_ (2 CPU) avg-cpu: %user %nice %system %iowait %steal %idle 0.17 0.00 0.20 0.46 0.00 99.17 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 1.56 30.45 39.61 4659620 6060644 scd0 0.00 0.02 0.00 3102 0 dm-0 1.96 30.01 38.42 4591998 5878155 dm-1 0.09 0.09 0.30 13840 45328
- tps: the number of transmissions per second for this device.
- kB_read/s: The amount of data read from the device (drive expressed) per second;
- kB_wrtn/s: The amount of data written to the device (drive expressed) per second;
- kB_read: The total amount of data read;
- kB_wrtn: The total amount of data written;
View disk read-write details
By iostat-x 13 You can see the detailed read and write situation of the disk, output three times every second. When you see that I/O wait time accounts for a high proportion of CPU time, the first thing to check is whether the machine is using a lot of swap space, and at the same time pay attention to whether iowait accounts for a large proportion of CPU consumption. If it indicates that there is a large bottleneck on the disk, and pay attention to await, it means that the disk is responding.Should time be less than 5ms:
[root@localhost ~]# iostat -x 1 3 Linux 3.10.0-1062.el7.x86_64 (localhost.localdomain) 2020 02/05 _x86_64_ (2 CPU) avg-cpu: %user %nice %system %iowait %steal %idle 0.17 0.00 0.20 0.46 0.00 99.16 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.01 0.49 0.63 0.95 30.59 39.78 89.58 0.34 214.23 49.16 323.48 8.55 1.34 scd0 0.00 0.00 0.00 0.00 0.02 0.00 98.48 0.00 1.21 1.21 0.00 0.95 0.00 dm-0 0.00 0.00 0.62 1.35 30.15 38.59 69.70 0.91 460.67 49.12 648.54 6.66 1.31 dm-1 0.00 0.00 0.02 0.07 0.09 0.30 8.52 0.04 442.74 95.43 521.17 6.91 0.06
avg-cpu represents overall CPU usage statistics, where is the average for all CPUs for multicore cpu:
- %user: Percentage of time the CPU spends in user mode.
- %nice: Percentage of time CPU is in user mode with NICE value.
- %system: Percentage of time that the CPU is in system mode.
- %iowait: The percentage of time the CPU waits for input and output to complete, and if the value of%iowait is too high, it indicates that there is an I/O bottleneck on the hard disk.
- %steal: The percentage of unconscious wait time for a virtual CPU when the hypervisor maintains another virtual processor.
- %idle: The percentage of CPU idle time, if the value of%idle is high, indicates that the CPU is idler; if the value of%idle is high but the system responds slowly, the CPU may be waiting to allocate memory and should increase the memory capacity; if the value of%idle remains below 10, indicating that the CPU processing power is relatively low, the resource most needed to solve in the system is the CPU.
Device represents device information:
- rrqm/s: The number of times per second read requests to the device are merged, and the file system merges requests to read the same block
- wrqm/s: Number of merged write requests to the device per second
- r/s: number of reads per second
- w/s: number of writes per second
- rkB/s: The amount of data read per second in kB units
- wkB/s: The amount of data written per second in kB units
- avgrq-sz: Average amount of data per IO operation (number of sectors in units)
- avgqu-sz: Average IO request queue length waiting to be processed
- await: Average wait time per IO request (including wait time and processing time in milliseconds)
- svctm: Average processing time per IO request in milliseconds
- %util: How much time per second is spent on I/O If%util is close to 100%, there are too many I/O requests and the I/O system is full.idle less than 70% IO pressure is greater, generally read faster wait.
Iostat-xmd 1 3: The new m option allows you to use M as the unit of output.
View processes that consume the most IO
In general, iostat is used to check if there is an IO bottleneck, then the iotop command is used to locate the most expensive IO for that process:
[root@localhost ~]# iotop Total DISK READ : 0.00 B/s | Total DISK WRITE : 0.00 B/s Actual DISK READ: 0.00 B/s | Actual DISK WRITE: 0.00 B/s TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND 123931 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.02 % [kworker/1:30] 94208 be/4 xiaolyuh 0.00 B/s 0.00 B/s 0.00 % 0.00 % nautilus-desktop --force [gmain] 1 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % systemd --system --deserialize 62 2 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [kthreadd] 94211 be/4 xiaolyuh 0.00 B/s 0.00 B/s 0.00 % 0.00 % gvfsd-trash --spawner :1.4 /org/gtk/gvfs/exec_spaw/0 4 be/0 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [kworker/0:0H] 6 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [ksoftirqd/0] 7 rt/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [migration/0] 8 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [rcu_bh] 9 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [rcu_sched] 10 be/0 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [lru-add-drain] ...
The iotop-p PID allows you to view the IO of a single process:
[root@localhost ~]# iotop -p 124146 Total DISK READ : 0.00 B/s | Total DISK WRITE : 0.00 B/s Actual DISK READ: 0.00 B/s | Actual DISK WRITE: 0.00 B/s TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND 124146 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % java -jar arthas-demo.jar
Application bottleneck
View the PID of a process
If you look at the pid of a Java process, ps-ef | grep java:
[root@localhost ~]# ps -ef | grep java root 124146 1984 0 09:13 pts/0 00:00:06 java -jar arthas-demo.jar root 125210 98378 0 10:07 pts/1 00:00:00 grep --color=auto java
View the number of specific processes
If you look at the number of java processes, ps-ef | grep java | wc-l:
[root@localhost ~]# ps -ef | grep java| wc -l 2
Check threads for deadlocks
To see if a thread is deadlocked, jstack -l pid:
[root@localhost ~]# jstack -l 124146 2020-05-02 10:13:38 Full thread dump OpenJDK 64-Bit Server VM (25.252-b09 mixed mode): "C1 CompilerThread1" #6 daemon prio=9 os_prio=0 tid=0x00007f27f013c000 nid=0x1e4f9 waiting on condition [0x0000000000000000] java.lang.Thread.State: RUNNABLE Locked ownable synchronizers: - None "C2 CompilerThread0" #5 daemon prio=9 os_prio=0 tid=0x00007f27f012d000 nid=0x1e4f8 waiting on condition [0x0000000000000000] java.lang.Thread.State: RUNNABLE Locked ownable synchronizers: - None "main" #1 prio=5 os_prio=0 tid=0x00007f27f004b800 nid=0x1e4f3 waiting on condition [0x00007f27f7274000] java.lang.Thread.State: TIMED_WAITING (sleeping) at java.lang.Thread.sleep(Native Method) at java.lang.Thread.sleep(Thread.java:340) at java.util.concurrent.TimeUnit.sleep(TimeUnit.java:386) at demo.MathGame.main(MathGame.java:17) Locked ownable synchronizers: - None ...
View the number of threads in a process
Ps-efL | grep [PID] | wc-l, such as:
[root@localhost ~]# ps -efL | grep 124146 | wc -l 12
See which threads use ps-Lp [pid] cu:
[root@localhost ~]# ps -Lp 124146 cu USER PID LWP %CPU NLWP %MEM VSZ RSS TTY STAT START TIME COMMAND root 124146 124146 0.0 11 2.5 2489116 35724 pts/0 Sl+ 09:13 0:00 java root 124146 124147 0.0 11 2.5 2489116 35724 pts/0 Sl+ 09:13 0:01 java root 124146 124148 0.0 11 2.5 2489116 35724 pts/0 Sl+ 09:13 0:00 VM Thread root 124146 124149 0.0 11 2.5 2489116 35724 pts/0 Sl+ 09:13 0:00 Reference Handl root 124146 124150 0.0 11 2.5 2489116 35724 pts/0 Sl+ 09:13 0:00 Finalizer root 124146 124151 0.0 11 2.5 2489116 35724 pts/0 Sl+ 09:13 0:00 Signal Dispatch root 124146 124152 0.0 11 2.5 2489116 35724 pts/0 Sl+ 09:13 0:00 C2 CompilerThre root 124146 124153 0.0 11 2.5 2489116 35724 pts/0 Sl+ 09:13 0:00 C1 CompilerThre root 124146 124154 0.0 11 2.5 2489116 35724 pts/0 Sl+ 09:13 0:00 Service Thread root 124146 124155 0.1 11 2.5 2489116 35724 pts/0 Sl+ 09:13 0:05 VM Periodic Tas root 124146 125362 0.0 11 2.5 2489116 35724 pts/0 Sl+ 10:13 0:00 Attach Listener
Count rows containing Error characters in all log files
Find / -type f-name'*.log'| xargs grep'ERROR', which is useful in troubleshooting problems:
[root@localhost ~]# find / -type f -name "*.log" | xargs grep "ERROR" /var/log/tuned/tuned.log:2020-03-13 18:05:59,145 ERROR tuned.utils.commands: Writing to file '/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor' error: '[Errno 19] No such device' /var/log/tuned/tuned.log:2020-03-13 18:05:59,145 ERROR tuned.utils.commands: Writing to file '/sys/devices/system/cpu/cpu1/cpufreq/scaling_governor' error: '[Errno 19] No such device' /var/log/tuned/tuned.log:2020-04-28 14:55:34,857 ERROR tuned.utils.commands: Writing to file '/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor' error: '[Errno 19] No such device' /var/log/tuned/tuned.log:2020-04-28 14:55:34,859 ERROR tuned.utils.commands: Writing to file '/sys/devices/system/cpu/cpu1/cpufreq/scaling_governor' error: '[Errno 19] No such device' /var/log/tuned/tuned.log:2020-04-28 15:23:19,037 ERROR tuned.utils.commands: Writing to file '/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor' error: '[Errno 19] No such device' ...
Specify JVM parameters at application startup
Java-jar-Xms128m-Xmx1024m-Xss512k-XX:PermSize=128m-XX:MaxPermSize=64m-XX:NewSize=64m-XX:MaxNewSize=256m arthas-demo.jar, for example:
[root@localhost ~]# java -jar -Xms128m -Xmx1024m -Xss512k -XX:PermSize=128m -XX:MaxPermSize=64m -XX:NewSize=64m -XX:MaxNewSize=256m arthas-demo.jar OpenJDK 64-Bit Server VM warning: ignoring option PermSize=128m; support was removed in 8.0 OpenJDK 64-Bit Server VM warning: ignoring option MaxPermSize=64m; support was removed in 8.0 157518=2*3*3*3*2917 illegalArgumentCount: 1, number is: -187733, need >= 2 illegalArgumentCount: 2, number is: -102156, need >= 2 173379=3*57793
summary
When using the linux command, if you want to see help you can use--help or man to view help information:
[root@localhost ~]# grep --help Usage: grep [options]... PATTERN [FILE]... Find PATTERN in each FILE or standard input. The default PTTERN is a basic regular expression (abbreviated BRE). For example: grep-i'Hello world'menu.h main.c ...
[root@localhost ~]# man grep GREP(1) General Commands Manual GREP(1) NAME grep, egrep, fgrep - Print rows matching a given pattern Overview SYNOPSIS grep [options] PATTERN [FILE...] grep [options] [-e PATTERN | -f FILE] [FILE...] Describe DESCRIPTION Grep searches FILE named file input (or standard input, if no file name is specified, or if the file name given is -) for files containing PATTERN with the given pattern ...
category | Monitoring commands | describe | Remarks |
---|---|---|---|
Memory bottleneck | free | View memory usage | |
Vmstat 3 (interval) 100 (monitoring times) | Check swap in/out for performance bottlenecks | Recommended Use | |
sar -r 3 | Similar to the free command, view memory usage, but do not include swap | ||
cpu bottleneck | top -H | Sort by cpu consumption | |
Ps-Lp process number cu | View cpu consumption ordering for a process | ||
cat /proc/cpuinfo |grep 'processor'|wc -l | View cpu cores | ||
top | View total cpu consumption, including sub-consumption such as user,system,idle,nice, etc. | ||
Top then shift+h: shows the java threads, and shift+M: sorts by memory usage; shift+P: sort by cpu time; shift+T: sort multi-core cpu by cpu cumulative usage time, and enter the top view by "1" | Dedicated performance checks, multi-core CPU s mainly look at the load on each core of CUP | ||
Sar-u 3 (interval) | View total cpu consumption as a percentage | ||
sar -q | View cpu load | ||
top -b -n 1 | awk '{if (NR<=7)print;else if($8=="D"){print;count++}}END{print "Total status D:"count}' | Calculates the number of uninterruptedsleep tasks in the cpu load. Tasks of uninterruptedsleep are counted in the cpu load, such as disk congestion | ||
Network Bottleneck | cat /var/log/messages | View the Kernel Log to see if the package was lost | |
watch more /proc/net/dev | Used to locate packet loss, packet error, to see network bottlenecks | Focus on the total number of drops and network packet transfers that do not exceed the network limit | |
sar -n SOCK | View network traffic | ||
netstat -na|grep ESTABLISHED|wc -l | View the number of successful tcp connections | This command consumes cpu in particular and is not suitable for prolonged monitoring data collection | |
netstat -na|awk'{print $6}'|sort |uniq -c |sort -nr | See the number of tcp States | ||
netstat -i | View network errors | ||
ss state ESTABLISHED| wc -l | More efficiently count the number of tcp connection states with ESTABLISHED s | ||
cat /proc/net/snmp | View and analyze network packet volume, traffic, packet errors, and packet dropouts in 240 seconds | Used to calculate retransmission rate tcpetr=RetransSegs/OutSegs | |
ping $ip | Testing network performance | ||
traceroute $ip | View addresses where routes pass | Commonly used to locate network time-consuming across routing segments | |
dig $domain name | View domain name resolution address | ||
dmesg | View System Kernel Log | ||
Disk Bottleneck | iostat -x -k -d 1 | Detailed list of disk reads and writes | When you see that I/O wait time accounts for a high proportion of CPU time, the first thing to check is whether the machine is using a lot of swap space, and at the same time pay attention to whether iowait accounts for a large proportion of CPU consumption, if it indicates that there is a large bottleneck on the disk, and pay attention to await, indicating that the response time of the disk is less than 5 ms |
iostat -x | View the read and write performance of each disk in the system | Focus on the percentage of cpu in await and iowait | |
iotop | See which process is reading IO in bulk | In general, iostat is used to see if there is an IO bottleneck before locating which process is reading IO in large quantities | |
df -hl | View Disk Remaining Space | ||
du -sh | See how much space your disk uses | ||
Application bottleneck | ps -ef | grep java | View the id number of a process |
ps -ef | grep httpd| wc -l | View the number of specific processes | ||
cat ***.log | grep ***Exception| wc -l | Statistics log file contains a specific number of exceptions | ||
jstack -l pid | Used to see if a thread is deadlocked | ||
awk'{print $8}' 2017-05-22-access_log|egrep '301|302'| wc -l | Count the number of lines of 301 and 302 status codes in log, $8 means column 8 is a status code, which can be changed according to the actual situation | Commonly used to apply fault location | |
grep 'wholesaleProductDetailNew' cookie_log | awk '{if($10=="200")}'print}' | awk 'print $12' | more | Print 12 columns of data containing specific data | ||
Grep "2017:05:22" cookielog | awk'($12>0.3) {print $12'--"$8}'| sort > directory address | Sort response times for apache or nginx access logs, $12 indicates response times for 12 lists in the cookie log. Used to troubleshoot if some access lengths cause overall RT lengthening | ||
grep -v 'HTTP/1.1" 200' | Remove URL s that are not 200 response codes | ||
Pgm-A-f $Application Cluster Name "grep"'301'log file address | wc-l" | View the number of 301 status codes in the log s of the entire cluster | ||
ps -efL | grep [PID] | wc -l | View the number of threads created by a process | ||
find / -type f -name "*.log" | xargs grep "ERROR" | Count rows containing Error characters in all log files | This is useful when troubleshooting problems | |
jstat -gc [pid] | View gc | ||
jstat -gcnew [pid] | View memory usage in the young area, including MTT (the maximum number of interactions is swapped to the old area), where TT is currently swapped | ||
jstat -gcold | View memory usage in old zone | ||
jmap -J-d64 -dump:format=b,file=dump.bin PID | dump out-of-memory snapshot | -J-d64 Prevents jmap from causing crash(jdk6 has bugs) | |
-XX:+HeapDumpOnOutOfMemeryError | Join at java startup, store memory snapshots when memory overflow occurs | ||
jmap -histo [pid] | Sort by object memory size | Note that this will result in full gc | |
gcore [pid] | Export completed memory snapshot | Usually used with jmap-permstat/opt/**/java gcore.bin to convert core dump to heap dump | |
-XX:HeapDumpPath=/home/logs -Xloggc:/home/log/gc.log -XX:+PrintGCDetails -XX:+PrintGCDateStamps | Add in Java startup parameters, print gc logs | ||
-server -Xms4000m -Xmx4000m -Xmn1500m -Xss256k -XX:PermSize=340m -XX:MaxPermSize=340m -XX:+UseConcMarkSweepGC | Adjust JVM heap size | xss is stack size |