Disk monitoring is the most basic monitoring, but in many cases, monitoring disks are ignored due to negligence of operation and maintenance, resulting in accidents.Friends who wish to read this article must take monitoring disks seriously.
The requirements for this case are as follows:
1) Detect disk condition every minute
2) When disk space usage or iNode usage is higher than 90%, an email alert is required, assuming the inbox is admin@admiin.com
3) Calculate the size of all subdirectories in a partition with over 90% usage and write the top three subdirectories to the mailbox above in the mail content
4) After the first warning, if we do not handle it in time, we need to warn every 30 minutes
5) When a script is executed every minute, it needs to be checked whether the script has been executed or not, and it will not be executed this time.
Point 1: View disk usage
Command: df
df looks at the total capacity, usage capacity, remaining capacity of the mounted disk, and so on, without any parameters. By default, it is displayed in K units. Common options are -i, -h, -k, -m.The -i option checks iNode usage, the -h option displays in the appropriate units, such as G or M, and the -k and -m options display in K and M, respectively.
# df -h //File System * Capacity * Used * Available% Mount Points Used /dev/sda3 16G 8.0G 7.9G 51% / devtmpfs 903M 0 903M 0% /dev tmpfs 912M 0 912M 0% /dev/shm tmpfs 912M 8.7M 903M 1% /run tmpfs 912M 0 912M 0% /sys/fs/cgroup /dev/sda1 197M 113M 85M 58% /boot tmpfs 183M 0 183M 0% /run/user/0
The first column is the partition name, the second is the total capacity of the partition, the third is how much has been used, the fourth is how much remains, the fifth is the percentage used, and the last is the mount point.
Point 2: View directory or file size
Command: du
The du command is used to see how much space a directory or file occupies.
Syntax: du [-abckmsh] [filename or directory name]
Common parameters:
-a: Indicates that all files and directories are listed in size.If no options and parameters are added later.Only the size of the directory (including subdirectories) is listed.If the du command does not specify a unit, the default display unit is KB.
Example command:
# du /tmp/test 0 /tmp/test # du -a /tmp/test 0 /tmp/test/1.txt 0 /tmp/test
-b: Indicates that the listed values are output in B units.
-k: Represents output in kilobytes, which is the same as the default output value without any options.
-m: Indicates output in MB units.
-h: Indicates the system auto-adjusting unit.
-c: means the final sum.Not commonly used, sample command:
# du -c /tmp/test 0 /tmp/test 0 Total dosage
-s: means only the sum is listed.Common options, sample commands:
# du -s /tmp/test 0 /tmp/test
Common usage: du-sh filename
Find all subdirectories in a directory with find and size them. The command is:
# find /dir/ -type d |sed '1d' |xargs du -sm
Description: Delete the first row with sed'1d'because the first row is the directory itself, and what we want to count is the subdirectory.
Point Three: Viewing the Process
Under Windows, you can enter Task Manager to view the process, and under Linux, you can also view the process with some commands.
1)top
This command is used to dynamically monitor the system resources occupied by a process and changes every 3 seconds.This command features processes that use the highest system resources (CPU, memory, disk IO, etc.) at the top.
The top command prints out a lot of information, including system load average, number of processes (Tasks), CPU usage, memory usage, and swap partition usage.
top focuses on the details of system resources used by the processes below.This part of things still reflects a lot, but there are also a few things to focus on: RES,%CPU,%MEN, COMMAND.RES is the size of the process's memory and%CPU is the percentage of the process's CPU used.COMMAND is the process name.
In top state, M can be sorted by memory usage size, P can be cut back to CPU, and 1 can be used to list the usage status of each CPU.A common command is top-bn1, which represents the use of static print system resources and is suitable for use in shell scripts.Top also has a -c option to display specific commands, which means a more detailed list in COMMAND.
2)ps
The PS command is used to report on the status of current system processes, common usages are ps aux and ps-elf.
In this case, we check to see if a process exists and we can use:
# ps aux |grep'process name'
If the name of this script is mon_disk.sh, this statistic is required:
# ps aux |grep 'mon_disk.sh' |grep -vE "$$|grep"
Description: $$is the PID of this process here, so exclude it because we need to check the old process before, not this one, and exclude the grep process as well.
Knowledge Point 4: Analysis of Warning Convergence Thought
In this case, there is a requirement that the next alarm should be 30 minutes after the alarm occurs.The script should be executed once a minute, the alert message will be sent once a minute, and the alert message will be sent once a minute. If we can't fix the problem in a short time, it will cause mail harassment.
The idea here is to introduce a counter and consider the following scenarios:
1) The script never warns, first warning
In this case, there are two things to do, one is to record the time stamp to a temporary file at this time, the other is to create a temporary file to record the number of alerts that occur but do not necessarily send messages, so the differences should be distinguished.
2) Warning before script, more than 30 minutes from last warning
To determine how long from the last alarm we need to use a temporary file to record the time stamp, ask if the difference between the time stamp of this alarm and the time stamp of the last alarm is greater than 1800 seconds.Depending on your needs, you can immediately send mail in just over 30 minutes, recording time stamps and alerts to two different temporary files.
3) Warning before script, no more than 30 minutes from last warning
Temporary files that record the number of alerts need to be viewed for less than 30 minutes. Mails are sent again only if the number of alerts is greater than or equal to 30.
This case reference script
#!/bin/bash ##Monitor disk usage and alert convergence (send mail once in 30 minutes) ##Author: ##Date: ##Version: v0.1 #Save the script name in variable s_name s_name=`echo $0 |awk -F '/' '{print $NF}'` #Define recipient mailbox mail_user=admin@admin.com #Define Check Disk Space Usage Function chk_sp() { df -m |sed '1d' |awk -F '%| +' '$5>90 {print $7,$5}'>/tmp/chk_sp.log n=`wc -l /tmp/chk_sp.log|awk '{print $1}'` if [ $n -gt 0 ] then tag=1 for d in `awk '{print $1}' /tmp/chk_sp.log` do find $d -type d |sed '1d' |xargs du -sm |sort -nr|head -3 done > /tmp/most_sp.txt fi } #Define a function to check iNode usage chk_in() { df -i |sed '1d'|awk -F '%| +' '$5>90 {print $7,$5}'>/tmp/chk_in.log n=`wc -l /tmp/chk_in.log|awk '{print $1}'` if [ $n -gt 0 ] then tag=2 fi } #Define the alert function (where mail.py is the script in Case 2) m_mail(){ log=$1 #The $1 here represents the first function chk_sp t_s=`date +%s` t_s2=`date -d "1 hours ago" +%s` if [ ! -f /tmp/$log ] then #Create $log file touch /tmp/$log #Add a permission, allow append only, do not allow change or deletion chattr +a /tmp/$log #First warning, you can write a timestamp 1 hour ago directly echo $t_s2 >> /tmp/$log fi #Whether or not the $log file has just been created, you need to look at the timestamp of the last line t_s2=`tail -1 /tmp/$log|awk '{print $1}'` #Write the current timestamp immediately after removing the last line, which is the timestamp of the last alert echo $t_s>>/tmp/$log #Take the difference between two timestamps v=$[$t_s-$t_s2] #Send mail immediately if the difference exceeds 1800 if [ $v -gt 1800 ] then #Send mail, where $2 is the second argument to the mail function, and here is a file python mail.py $mail_user "Disk usage exceeds 90%" "`cat $2`" 2>/dev/null #Define counter temporary file and write 0 echo "0" > /tmp/$log.count else #If the counter temporary file does not exist, you need to create and write 0 if [ ! -f /tmp/$log.count ] then echo "0" > /tmp/$log.count fi nu=`cat /tmp/$log.count` #Add 1 to every alarm within 30 minutes nu2=$[$nu+1] echo $nu2>/tmp/$log.count #When the number of alerts exceeds 30, the message needs to be sent again if [ $nu2 -gt 30 ] then python mail.py $mail_user "Disk usage exceeds 90%Lasted 30 minutes" "`cat $2`" 2>/dev/null #After the second warning, start the counter from 0 again echo "0" > /tmp/$log.count fi fi } #Save the process status in a temporary file, if there is a problem with the number of lines to be piped ps aux |grep "$s_name" |grep -vE "$$|grep">/tmp/ps.tmp p_n=`wc -l /tmp/ps.tmp|awk '{print $1}'` #When the number of processes is greater than 0, the last script has not been executed if [ $p_n -gt 0 ] then exit fi chk_sp chk_in if [ $tag == 1 ] then m_mail chk_sp /tmp/most_sp.txt elif [ $tag == 2 ] then m_mail chk_in /tmp/chk_in.log fi