smartctl command disk detection and operation of Linux

Keywords: Linux Operation & Maintenance IDE

Knowledge introduction

SMART is a disk self analysis and detection technology, which has been basically popularized as early as the late 1990s
Each hard disk (including IDE and SCSI) will record its own parameters when running
These parameters include model, capacity, temperature, density, sector, seek time, transmission, bit error rate, etc
After thousands of hours of hard disk operation, many internal physical parameters will change
If a parameter exceeds the alarm threshold, the hard disk is close to being damaged
At this time, the hard disk is still working. If the user ignores this alarm, continue to use it
Then the hard disk will become very unreliable and may fail at any time

Enable SMART
SMART is matched with the corresponding functions on the motherboard BIOS
To use SMART, you must first enter the motherboard BIOS settings and start the relevant settings
Generally, mainboards from Pentium2 level support SMART
After the BIOS starts, it is the operating system level
Unfortunately, Windows does not have built-in SMART related tools (third-party tool software needs to be installed)
Fortunately, SMART support has been available on Linux for a long time
If you install Linux on VMware and other virtual machines, you can see a service startup error when the system starts: smartd
This server is the daemon process of smart (because the hard disk of vmware virtual machine does not support smart, an error is reported)

grep"error" /var/log/messages*

Common commands

1. Smartctl - a < device >: displays all information about SMART hard disk. Check that the device has SMART technology turned on.

2. Smartctl - H < device >: check the health of the hard disk. Generally, I can't see the problem. It's useless.

3. Smartctl - L selftest < device >: displays hard disk test information.

4. Smartctl - L error < device >: displays the hard disk history error message.

5. Smartctl - a < device >: displays the SMART vendor attributes and values of the device.

6. There are four methods to test the hard disk manually:
Smartctl - t short < device > detects the hard disk in the background and consumes a short time
Smartctl - t long < device > background detection of hard disk takes a long time
Smartctl - C - t short < device > the foreground detects the hard disk and consumes a short time
Smartctl - C - t long < device > the foreground detects the hard disk, which takes a long time

In fact, it uses the self-test program of hard disk SMART. At this point, you can interrupt the background test with smartctl -X.

7. Smartctl - I < device >: displays the identity information of the device and checks whether SMART support is turned on the hard disk.
See: SMART support is: Enabled, indicating that the hard disk supports SMART.
If Disabled, use: smartctl -- SMART = on -- offlineauto = on -- saveauto = on < device > to enable SMART.

8,smartctl -s on <device>     If SMART technology is not turned on, use this command to turn on SMART technology.

Processing process

First, check the health status of the disk through smartctl -H /dev/sda, then view the details of the disk through smartctl -a /dev/sda, then conduct a short-term test on the disk smartctl -t short /dev/sda, and finally view the disk test results smartctl -l selftest /dev/sda. The basic disk health status can be located. Finally, check the disk error log smartctl -l error /dev/sdb

View test results

# smartctl -a /dev/sda
smartctl 5.42 2011-10-20 r3458 [x86_64-linux-2.6.32-358.el6.x86_64] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Device Model:     KINGSTON SV300S37A60G
Serial Number:    50026B724A01E182
LU WWN Device Id: 5 0026b7 24a01e182
Firmware Version: 580ABBF0
User Capacity:    60,022,480,896 bytes [60.0 GB]
Sector Size:      512 bytes logical/physical
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   8
ATA Standard is:  ACS-2 revision 3
Local Time is:    Wed Oct 11 15:41:49 2017 CST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x02) Offline data collection activity
                    was completed without error.
                    Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                    without error or no self-test has ever 
                    been run.
Total time to complete Offline 
data collection:        (    0) seconds.
Offline data collection
capabilities:            (0x7d) SMART execute Offline immediate.
                    No Auto Offline data collection support.
                    Abort Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                    General Purpose Logging supported.
Short self-test routine 
recommended polling time:    (   1) minutes.
Extended self-test routine
recommended polling time:    (  48) minutes.
Conveyance self-test routine
recommended polling time:    (   2) minutes.
SCT capabilities:          (0x0025) SCT Status supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x0032   095   095   050    Old_age   Always       -       8601004262
  5 Reallocated_Sector_Ct   0x0033   099   099   003    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   085   085   000    Old_age   Always       -       145066815403100
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       120
171 Unknown_Attribute       0x000a   100   100   000    Old_age   Always       -       0
172 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       0
174 Unknown_Attribute       0x0030   000   000   000    Old_age   Offline      -       97
177 Wear_Leveling_Count     0x0000   000   000   000    Old_age   Offline      -       96
181 Program_Fail_Cnt_Total  0x000a   100   100   000    Old_age   Always       -       0
182 Erase_Fail_Count_Total  0x0032   100   100   000    Old_age   Always       -       0
187 Reported_Uncorrect      0x0012   100   100   000    Old_age   Always       -       0
189 High_Fly_Writes         0x0000   029   041   000    Old_age   Offline      -       77312098333
194 Temperature_Celsius     0x0022   029   041   000    Old_age   Always       -       29 (Min/Max 18/41)
195 Hardware_ECC_Recovered  0x001c   102   102   000    Old_age   Offline      -       8601004262
196 Reallocated_Event_Count 0x0033   099   099   003    Pre-fail  Always       -       0
201 Soft_Read_Error_Rate    0x001c   102   102   000    Old_age   Offline      -       8601004262
204 Soft_ECC_Correction     0x001c   102   102   000    Old_age   Offline      -       8601004262
230 Head_Amplitude          0x0013   100   100   000    Pre-fail  Always       -       100
231 Temperature_Celsius     0x0013   091   091   010    Pre-fail  Always       -       0
233 Media_Wearout_Indicator 0x0032   000   000   000    Old_age   Always       -       9261
234 Unknown_Attribute       0x0032   000   000   000    Old_age   Always       -       14820
241 Total_LBAs_Written      0x0032   000   000   000    Old_age   Always       -       14820
242 Total_LBAs_Read         0x0032   000   000   000    Old_age   Always       -       6033

SMART Error Log not supported
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     13404         -
# 2  Selective offline   Completed without error       00%     13404         -
# 3  Selective offline   Completed without error       00%     13404         -
# 4  Selective offline   Completed without error       00%     13404         -
# 5  Short offline       Completed without error       00%     13403         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0       10  Not_testing
    2       10       20  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

FLAG is a FLAG, WHEN_FAILED stands for error message, and when is displayed above_ The failed column is empty, indicating that the hard disk is not faulty. If WHEN_FAILED displays a number indicating that the hard disk track may have a large bad track.
1,read error rate   Error reading rate: record the number of data reading errors (cumulative). A non-0 value indicates that the hard disk has or may be about to have a bad track;

2,Reallocated_Sector_Ct   The initial value of the number of bad blocks generated after delivery is 100. If there are bad blocks, increase from 1 and increase by 1 for every 4 bad blocks

2,reallocated sectors count   Reassign sector count: during hard disk production, some sectors are reserved. When some normal sectors have read / write / verification errors, remap to the reserved sector, suspend the abnormal sector and increase the count. As the count increases, io performance plummets. If the value is not 0, you need to pay close attention to the health of the hard disk; If it continues to climb, the hard disk has been damaged; If the number of reallocated sectors exceeds the number of reserved sectors, it will not be repairable;

3,power-on time   Cumulative power on time: refers to the cumulative value of hard disk power on time. (unit: day / hour / minute / second. Sleep / suspend is not included? The newly purchased hard disk shall be less than 100hrs);

4,power cycle count   Power switch count: add one count for each power on, and the new hard disk should be less than 10 times;

5,temperature   Temperature: nothing to say. The temperature of the hard disk is just a few degrees higher than the working environment in theory. (sudo hddtemp /dev/sda)

6,reallocetion event count   Number of remapping sector operations: remember the remapping sector above? This is the number of operations. Successful and failed operations are counted. Success is easy to say. Maybe the hard disk can be saved. If it fails, maybe the hard disk will be scrapped;

7,throughput performance   Disk throughput: average throughput performance (generally, there is no value until the manual Offline S.M.A.R.T. test is performed.);

spinup time   Time for spindle motor to reach the required speed (MS / s);

start/stop count   Motor start / stop times (it can be regarded as the number of start / shutdown times or recovery after hibernation, and the count shall be increased once. The new hard disk shall be less than 10);

reallocated sectors count   Reassign sector count: during hard disk production, some sectors are reserved. When some normal sectors have read / write / verification errors, remap to the reserved sector, suspend the abnormal sector and increase the count. As the count increases, io performance plummets. If the value is not 0, you need to pay close attention to the health of the hard disk; If it continues to climb, the hard disk has been damaged; If the number of reallocated sectors exceeds the number of reserved sectors, it will not be repairable;

seek error rate   Seek error rate: if the head positioning error is once, the technology is increased once. If it continues to climb, it may be that the mechanical part is about to fail;

seek timer performance   Seek time: the shorter the time required for seek, the faster the data will be read. However, if the time increases, the mechanical part may fail soon;

power-on time   Cumulative power on time: refers to the cumulative value of hard disk power on time. (unit: day / hour / minute / second. Sleep / suspend is not included? The newly purchased hard disk shall be less than 100hrs);

spinup retry count   Motor start failure count: the cumulative value of motor start failure to the specified speed. If it fails, the power system may fail;

power cycle count   Power switch count: add one count for each power on, and the new hard disk should be less than 10 times;

g-sensor error rate   Fall count: count of abnormal acceleration (such as falling and throwing) - the magnetic head will immediately return to the landing zone and increase the count once;

power-off retract count   Abnormal power off times: the number of times that the magnetic head does not fully return to the landing zone before power off. Each abnormal power off will increase the count once;

load/unload cycle count   Head homing times: refers to the number of times the head returns to the landing zone each time during operation. (ps: it is rumored that a linux system - without naming names, will constantly force the magnetic head to return to when using the battery, and the maximum number of magnetic head returns is about 600k times, so it is thought that linux will damage the hard disk, which is not the case in fact);

reallocetion event count   Number of remapping sector operations: remember the remapping sector above? This is the number of operations. Successful and failed operations are counted. Success is easy to say. Maybe the hard disk can be saved. If it fails, maybe the hard disk will be scrapped;

current pending sector count   Number of sectors to be mapped: the number of sectors with exceptions and the number of sectors to be mapped. If the abnormal sector is successfully read and written later, the count will be reduced and the sector will not be remapped. Read errors are not remapped, only write errors are remapped;

uncorrectable sector count   Number of irreparable sectors: count of all read / write errors. If it is not 0, it indicates that there is a bad track, and the hard disk is scrapped;

Explanation of additional Attributes information of SSD SSD:

Among them, we pay more attention to the following four points:

1,Media_Wearout_Indicator:     Use cost, 100 means no cost; Indicates the degree of erasure times of NAND on SSD. The initial value is 100. With the increase of erasure times, it begins to decrease linearly. The decreasing speed is based on the proportion of erasure times from 0 to the maximum. Once this value is reduced to 1, it will not be reduced. At the same time, it indicates that NAND has reached the maximum number of erasures on the SSD. At this time, it is recommended to back up the data and replace the SSD.

The above machine is 099. According to 100 drops of blood, only 1 drop of blood is consumed at present

2,Reallocated_Sector_Ct: the number of bad blocks generated after leaving the factory. The initial value is 100. If there are bad blocks, increase from 1 and increase by 1 for every 4 bad blocks

There are no bad blocks in the offer machine here

3,Host_Writes_32MiB: 32MiB has been written. The raw value increases by 1 for every 65536 sectors written. This sector is also a unit of quantity, 512 bytes

For example, this disk is 1284966 * 65536 * 512 = 40155.1875 GB

Note that each machine has a disk with less writing. This disk is the hotspot disk.

We have 7 ssd disks for each machine. Six disks are raid 5, and the seventh disk is hotspare.

4,Available_ Reservd_ Space: the remaining reserved space on the SSD. The initial value is 100, which means 100%, and the threshold value is 10. Decreasing to 10 means that the reserved space can no longer be reduced
-----------------------------------
smartctl output details
https://blog.51cto.com/wenqiang/1434581

Author: goodness is like water_ 001
Link: https://www.jianshu.com/p/da26137065d9
Source: Jianshu
The copyright belongs to the author. For commercial reprint, please contact the author for authorization, and for non-commercial reprint, please indicate the source.

Posted by binarylime on Sat, 06 Nov 2021 06:07:19 -0700