Case 5. Monitoring disk usage

Keywords: Linux Python Windows shell

Disk monitoring is the most basic monitoring, but in many cases, monitoring disks are ignored due to negligence of operation and maintenance, resulting in accidents.Friends who wish to read this article must take monitoring disks seriously.

The requirements for this case are as follows:

1) Detect disk condition every minute

2) When disk space usage or iNode usage is higher than 90%, an email alert is required, assuming the inbox is admin@admiin.com

3) Calculate the size of all subdirectories in a partition with over 90% usage and write the top three subdirectories to the mailbox above in the mail content

4) After the first warning, if we do not handle it in time, we need to warn every 30 minutes

5) When a script is executed every minute, it needs to be checked whether the script has been executed or not, and it will not be executed this time.


Point 1: View disk usage

Command: df

df looks at the total capacity, usage capacity, remaining capacity of the mounted disk, and so on, without any parameters. By default, it is displayed in K units. Common options are -i, -h, -k, -m.The -i option checks iNode usage, the -h option displays in the appropriate units, such as G or M, and the -k and -m options display in K and M, respectively.

# df -h
//File System * Capacity * Used * Available% Mount Points Used
/dev/sda3        16G  8.0G  7.9G   51% /
devtmpfs        903M     0  903M    0% /dev
tmpfs           912M     0  912M    0% /dev/shm
tmpfs           912M  8.7M  903M    1% /run
tmpfs           912M     0  912M    0% /sys/fs/cgroup
/dev/sda1       197M  113M   85M   58% /boot
tmpfs           183M     0  183M    0% /run/user/0

The first column is the partition name, the second is the total capacity of the partition, the third is how much has been used, the fourth is how much remains, the fifth is the percentage used, and the last is the mount point.


Point 2: View directory or file size

Command: du

The du command is used to see how much space a directory or file occupies.

Syntax: du [-abckmsh] [filename or directory name]

Common parameters:

-a: Indicates that all files and directories are listed in size.If no options and parameters are added later.Only the size of the directory (including subdirectories) is listed.If the du command does not specify a unit, the default display unit is KB.

Example command:

# du /tmp/test
0	/tmp/test
# du -a /tmp/test
0	/tmp/test/1.txt
0	/tmp/test

-b: Indicates that the listed values are output in B units.

-k: Represents output in kilobytes, which is the same as the default output value without any options.

-m: Indicates output in MB units.

-h: Indicates the system auto-adjusting unit.

-c: means the final sum.Not commonly used, sample command:

# du -c /tmp/test
0	/tmp/test
0	Total dosage

-s: means only the sum is listed.Common options, sample commands:

# du -s /tmp/test
0	/tmp/test


Common usage: du-sh filename


Find all subdirectories in a directory with find and size them. The command is:

# find /dir/ -type d |sed '1d' |xargs du -sm

Description: Delete the first row with sed'1d'because the first row is the directory itself, and what we want to count is the subdirectory.


Point Three: Viewing the Process

Under Windows, you can enter Task Manager to view the process, and under Linux, you can also view the process with some commands.

1)top

This command is used to dynamically monitor the system resources occupied by a process and changes every 3 seconds.This command features processes that use the highest system resources (CPU, memory, disk IO, etc.) at the top.

The top command prints out a lot of information, including system load average, number of processes (Tasks), CPU usage, memory usage, and swap partition usage.

top focuses on the details of system resources used by the processes below.This part of things still reflects a lot, but there are also a few things to focus on: RES,%CPU,%MEN, COMMAND.RES is the size of the process's memory and%CPU is the percentage of the process's CPU used.COMMAND is the process name.

In top state, M can be sorted by memory usage size, P can be cut back to CPU, and 1 can be used to list the usage status of each CPU.A common command is top-bn1, which represents the use of static print system resources and is suitable for use in shell scripts.Top also has a -c option to display specific commands, which means a more detailed list in COMMAND.

2)ps

The PS command is used to report on the status of current system processes, common usages are ps aux and ps-elf.

In this case, we check to see if a process exists and we can use:

# ps aux |grep'process name'

If the name of this script is mon_disk.sh, this statistic is required:

# ps aux |grep 'mon_disk.sh' |grep -vE "$$|grep"

Description: $$is the PID of this process here, so exclude it because we need to check the old process before, not this one, and exclude the grep process as well.


Knowledge Point 4: Analysis of Warning Convergence Thought

In this case, there is a requirement that the next alarm should be 30 minutes after the alarm occurs.The script should be executed once a minute, the alert message will be sent once a minute, and the alert message will be sent once a minute. If we can't fix the problem in a short time, it will cause mail harassment.

The idea here is to introduce a counter and consider the following scenarios:

1) The script never warns, first warning

In this case, there are two things to do, one is to record the time stamp to a temporary file at this time, the other is to create a temporary file to record the number of alerts that occur but do not necessarily send messages, so the differences should be distinguished.

2) Warning before script, more than 30 minutes from last warning

To determine how long from the last alarm we need to use a temporary file to record the time stamp, ask if the difference between the time stamp of this alarm and the time stamp of the last alarm is greater than 1800 seconds.Depending on your needs, you can immediately send mail in just over 30 minutes, recording time stamps and alerts to two different temporary files.

3) Warning before script, no more than 30 minutes from last warning

Temporary files that record the number of alerts need to be viewed for less than 30 minutes. Mails are sent again only if the number of alerts is greater than or equal to 30.


This case reference script

#!/bin/bash
##Monitor disk usage and alert convergence (send mail once in 30 minutes)
##Author:
##Date:
##Version: v0.1

#Save the script name in variable s_name
s_name=`echo $0 |awk -F '/' '{print $NF}'`
#Define recipient mailbox
mail_user=admin@admin.com

#Define Check Disk Space Usage Function
chk_sp()
{
  df -m |sed '1d' |awk -F '%| +' '$5>90 {print $7,$5}'>/tmp/chk_sp.log
  n=`wc -l /tmp/chk_sp.log|awk '{print $1}'`
  if [ $n -gt 0 ]
  then
      tag=1
      for d in `awk '{print $1}' /tmp/chk_sp.log`
      do
        find $d -type d |sed '1d' |xargs du -sm |sort -nr|head -3
      done > /tmp/most_sp.txt
  fi
}

#Define a function to check iNode usage
chk_in()
{
  df -i |sed '1d'|awk -F '%| +' '$5>90 {print $7,$5}'>/tmp/chk_in.log
  n=`wc -l /tmp/chk_in.log|awk '{print $1}'`
  if [ $n -gt 0 ]
  then
      tag=2
  fi
}

#Define the alert function (where mail.py is the script in Case 2)
m_mail(){
   log=$1  #The $1 here represents the first function chk_sp
   t_s=`date +%s`
   t_s2=`date -d "1 hours ago" +%s`
   if [ ! -f /tmp/$log ]
   then
       #Create $log file
       touch /tmp/$log
       #Add a permission, allow append only, do not allow change or deletion
       chattr +a /tmp/$log
       #First warning, you can write a timestamp 1 hour ago directly
       echo $t_s2 >> /tmp/$log
   fi
   #Whether or not the $log file has just been created, you need to look at the timestamp of the last line
   t_s2=`tail -1 /tmp/$log|awk '{print $1}'`
   #Write the current timestamp immediately after removing the last line, which is the timestamp of the last alert
   echo $t_s>>/tmp/$log
   #Take the difference between two timestamps
   v=$[$t_s-$t_s2]
   #Send mail immediately if the difference exceeds 1800
   if [ $v -gt 1800 ]
   then
      #Send mail, where $2 is the second argument to the mail function, and here is a file
      python mail.py $mail_user "Disk usage exceeds 90%" "`cat $2`" 2>/dev/null
      #Define counter temporary file and write 0
      echo "0" > /tmp/$log.count
   else
      #If the counter temporary file does not exist, you need to create and write 0
      if [ ! -f /tmp/$log.count ]
      then
          echo "0" > /tmp/$log.count
      fi
      nu=`cat /tmp/$log.count`
      #Add 1 to every alarm within 30 minutes
      nu2=$[$nu+1]
      echo $nu2>/tmp/$log.count
      #When the number of alerts exceeds 30, the message needs to be sent again
      if [ $nu2 -gt 30 ]
      then
          python mail.py $mail_user "Disk usage exceeds 90%Lasted 30 minutes" "`cat $2`" 2>/dev/null
          #After the second warning, start the counter from 0 again
          echo "0" > /tmp/$log.count
      fi
    fi
}

#Save the process status in a temporary file, if there is a problem with the number of lines to be piped
ps aux |grep "$s_name" |grep -vE "$$|grep">/tmp/ps.tmp
p_n=`wc -l /tmp/ps.tmp|awk '{print $1}'`

#When the number of processes is greater than 0, the last script has not been executed
if [ $p_n -gt 0 ]
then
    exit
fi

chk_sp
chk_in

if [ $tag == 1 ]
then
   m_mail chk_sp /tmp/most_sp.txt
elif [ $tag == 2 ]
then
   m_mail chk_in /tmp/chk_in.log
fi


Posted by BeastRider on Sat, 20 Jul 2019 18:20:16 -0700