Knowledge introduction
SMART is a disk self analysis and detection technology, which has been basically popularized as early as the late 1990s
Each hard disk (including IDE and SCSI) will record its own parameters when running
These parameters include model, capacity, temperature, density, sector, seek time, transmission, bit error rate, etc
After thousands of hours of hard disk operation, many internal physical parameters will change
If a parameter exceeds the alarm threshold, the hard disk is close to being damaged
At this time, the hard disk is still working. If the user ignores this alarm, continue to use it
Then the hard disk will become very unreliable and may fail at any time
Enable SMART
SMART is matched with the corresponding functions on the motherboard BIOS
To use SMART, you must first enter the motherboard BIOS settings and start the relevant settings
Generally, mainboards from Pentium2 level support SMART
After the BIOS starts, it is the operating system level
Unfortunately, Windows does not have built-in SMART related tools (third-party tool software needs to be installed)
Fortunately, SMART support has been available on Linux for a long time
If you install Linux on VMware and other virtual machines, you can see a service startup error when the system starts: smartd
This server is the daemon process of smart (because the hard disk of vmware virtual machine does not support smart, an error is reported)
grep"error" /var/log/messages*
Common commands
1. Smartctl - a < device >: displays all information about SMART hard disk. Check that the device has SMART technology turned on.
2. Smartctl - H < device >: check the health of the hard disk. Generally, I can't see the problem. It's useless.
3. Smartctl - L selftest < device >: displays hard disk test information.
4. Smartctl - L error < device >: displays the hard disk history error message.
5. Smartctl - a < device >: displays the SMART vendor attributes and values of the device.
6. There are four methods to test the hard disk manually:
Smartctl - t short < device > detects the hard disk in the background and consumes a short time
Smartctl - t long < device > background detection of hard disk takes a long time
Smartctl - C - t short < device > the foreground detects the hard disk and consumes a short time
Smartctl - C - t long < device > the foreground detects the hard disk, which takes a long time
In fact, it uses the self-test program of hard disk SMART. At this point, you can interrupt the background test with smartctl -X.
7. Smartctl - I < device >: displays the identity information of the device and checks whether SMART support is turned on the hard disk.
See: SMART support is: Enabled, indicating that the hard disk supports SMART.
If Disabled, use: smartctl -- SMART = on -- offlineauto = on -- saveauto = on < device > to enable SMART.
8,smartctl -s on <device> If SMART technology is not turned on, use this command to turn on SMART technology.
Processing process
First, check the health status of the disk through smartctl -H /dev/sda, then view the details of the disk through smartctl -a /dev/sda, then conduct a short-term test on the disk smartctl -t short /dev/sda, and finally view the disk test results smartctl -l selftest /dev/sda. The basic disk health status can be located. Finally, check the disk error log smartctl -l error /dev/sdb
View test results
# smartctl -a /dev/sda smartctl 5.42 2011-10-20 r3458 [x86_64-linux-2.6.32-358.el6.x86_64] (local build) Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION === Device Model: KINGSTON SV300S37A60G Serial Number: 50026B724A01E182 LU WWN Device Id: 5 0026b7 24a01e182 Firmware Version: 580ABBF0 User Capacity: 60,022,480,896 bytes [60.0 GB] Sector Size: 512 bytes logical/physical Device is: Not in smartctl database [for details use: -P showall] ATA Version is: 8 ATA Standard is: ACS-2 revision 3 Local Time is: Wed Oct 11 15:41:49 2017 CST SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x02) Offline data collection activity was completed without error. Auto Offline Data Collection: Disabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 0) seconds. Offline data collection capabilities: (0x7d) SMART execute Offline immediate. No Auto Offline data collection support. Abort Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 48) minutes. Conveyance self-test routine recommended polling time: ( 2) minutes. SCT capabilities: (0x0025) SCT Status supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x0032 095 095 050 Old_age Always - 8601004262 5 Reallocated_Sector_Ct 0x0033 099 099 003 Pre-fail Always - 0 9 Power_On_Hours 0x0032 085 085 000 Old_age Always - 145066815403100 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 120 171 Unknown_Attribute 0x000a 100 100 000 Old_age Always - 0 172 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 0 174 Unknown_Attribute 0x0030 000 000 000 Old_age Offline - 97 177 Wear_Leveling_Count 0x0000 000 000 000 Old_age Offline - 96 181 Program_Fail_Cnt_Total 0x000a 100 100 000 Old_age Always - 0 182 Erase_Fail_Count_Total 0x0032 100 100 000 Old_age Always - 0 187 Reported_Uncorrect 0x0012 100 100 000 Old_age Always - 0 189 High_Fly_Writes 0x0000 029 041 000 Old_age Offline - 77312098333 194 Temperature_Celsius 0x0022 029 041 000 Old_age Always - 29 (Min/Max 18/41) 195 Hardware_ECC_Recovered 0x001c 102 102 000 Old_age Offline - 8601004262 196 Reallocated_Event_Count 0x0033 099 099 003 Pre-fail Always - 0 201 Soft_Read_Error_Rate 0x001c 102 102 000 Old_age Offline - 8601004262 204 Soft_ECC_Correction 0x001c 102 102 000 Old_age Offline - 8601004262 230 Head_Amplitude 0x0013 100 100 000 Pre-fail Always - 100 231 Temperature_Celsius 0x0013 091 091 010 Pre-fail Always - 0 233 Media_Wearout_Indicator 0x0032 000 000 000 Old_age Always - 9261 234 Unknown_Attribute 0x0032 000 000 000 Old_age Always - 14820 241 Total_LBAs_Written 0x0032 000 000 000 Old_age Always - 14820 242 Total_LBAs_Read 0x0032 000 000 000 Old_age Always - 6033 SMART Error Log not supported SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed without error 00% 13404 - # 2 Selective offline Completed without error 00% 13404 - # 3 Selective offline Completed without error 00% 13404 - # 4 Selective offline Completed without error 00% 13404 - # 5 Short offline Completed without error 00% 13403 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 10 Not_testing 2 10 20 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.
FLAG is a FLAG, WHEN_FAILED stands for error message, and when is displayed above_ The failed column is empty, indicating that the hard disk is not faulty. If WHEN_FAILED displays a number indicating that the hard disk track may have a large bad track.
1,read error rate Error reading rate: record the number of data reading errors (cumulative). A non-0 value indicates that the hard disk has or may be about to have a bad track;
2,Reallocated_Sector_Ct The initial value of the number of bad blocks generated after delivery is 100. If there are bad blocks, increase from 1 and increase by 1 for every 4 bad blocks
2,reallocated sectors count Reassign sector count: during hard disk production, some sectors are reserved. When some normal sectors have read / write / verification errors, remap to the reserved sector, suspend the abnormal sector and increase the count. As the count increases, io performance plummets. If the value is not 0, you need to pay close attention to the health of the hard disk; If it continues to climb, the hard disk has been damaged; If the number of reallocated sectors exceeds the number of reserved sectors, it will not be repairable;
3,power-on time Cumulative power on time: refers to the cumulative value of hard disk power on time. (unit: day / hour / minute / second. Sleep / suspend is not included? The newly purchased hard disk shall be less than 100hrs);
4,power cycle count Power switch count: add one count for each power on, and the new hard disk should be less than 10 times;
5,temperature Temperature: nothing to say. The temperature of the hard disk is just a few degrees higher than the working environment in theory. (sudo hddtemp /dev/sda)
6,reallocetion event count Number of remapping sector operations: remember the remapping sector above? This is the number of operations. Successful and failed operations are counted. Success is easy to say. Maybe the hard disk can be saved. If it fails, maybe the hard disk will be scrapped;
7,throughput performance Disk throughput: average throughput performance (generally, there is no value until the manual Offline S.M.A.R.T. test is performed.);
spinup time Time for spindle motor to reach the required speed (MS / s);
start/stop count Motor start / stop times (it can be regarded as the number of start / shutdown times or recovery after hibernation, and the count shall be increased once. The new hard disk shall be less than 10);
reallocated sectors count Reassign sector count: during hard disk production, some sectors are reserved. When some normal sectors have read / write / verification errors, remap to the reserved sector, suspend the abnormal sector and increase the count. As the count increases, io performance plummets. If the value is not 0, you need to pay close attention to the health of the hard disk; If it continues to climb, the hard disk has been damaged; If the number of reallocated sectors exceeds the number of reserved sectors, it will not be repairable;
seek error rate Seek error rate: if the head positioning error is once, the technology is increased once. If it continues to climb, it may be that the mechanical part is about to fail;
seek timer performance Seek time: the shorter the time required for seek, the faster the data will be read. However, if the time increases, the mechanical part may fail soon;
power-on time Cumulative power on time: refers to the cumulative value of hard disk power on time. (unit: day / hour / minute / second. Sleep / suspend is not included? The newly purchased hard disk shall be less than 100hrs);
spinup retry count Motor start failure count: the cumulative value of motor start failure to the specified speed. If it fails, the power system may fail;
power cycle count Power switch count: add one count for each power on, and the new hard disk should be less than 10 times;
g-sensor error rate Fall count: count of abnormal acceleration (such as falling and throwing) - the magnetic head will immediately return to the landing zone and increase the count once;
power-off retract count Abnormal power off times: the number of times that the magnetic head does not fully return to the landing zone before power off. Each abnormal power off will increase the count once;
load/unload cycle count Head homing times: refers to the number of times the head returns to the landing zone each time during operation. (ps: it is rumored that a linux system - without naming names, will constantly force the magnetic head to return to when using the battery, and the maximum number of magnetic head returns is about 600k times, so it is thought that linux will damage the hard disk, which is not the case in fact);
reallocetion event count Number of remapping sector operations: remember the remapping sector above? This is the number of operations. Successful and failed operations are counted. Success is easy to say. Maybe the hard disk can be saved. If it fails, maybe the hard disk will be scrapped;
current pending sector count Number of sectors to be mapped: the number of sectors with exceptions and the number of sectors to be mapped. If the abnormal sector is successfully read and written later, the count will be reduced and the sector will not be remapped. Read errors are not remapped, only write errors are remapped;
uncorrectable sector count Number of irreparable sectors: count of all read / write errors. If it is not 0, it indicates that there is a bad track, and the hard disk is scrapped;
Explanation of additional Attributes information of SSD SSD:
Among them, we pay more attention to the following four points:
1,Media_Wearout_Indicator: Use cost, 100 means no cost; Indicates the degree of erasure times of NAND on SSD. The initial value is 100. With the increase of erasure times, it begins to decrease linearly. The decreasing speed is based on the proportion of erasure times from 0 to the maximum. Once this value is reduced to 1, it will not be reduced. At the same time, it indicates that NAND has reached the maximum number of erasures on the SSD. At this time, it is recommended to back up the data and replace the SSD.
The above machine is 099. According to 100 drops of blood, only 1 drop of blood is consumed at present
2,Reallocated_Sector_Ct: the number of bad blocks generated after leaving the factory. The initial value is 100. If there are bad blocks, increase from 1 and increase by 1 for every 4 bad blocks
There are no bad blocks in the offer machine here
3,Host_Writes_32MiB: 32MiB has been written. The raw value increases by 1 for every 65536 sectors written. This sector is also a unit of quantity, 512 bytes
For example, this disk is 1284966 * 65536 * 512 = 40155.1875 GB
Note that each machine has a disk with less writing. This disk is the hotspot disk.
We have 7 ssd disks for each machine. Six disks are raid 5, and the seventh disk is hotspare.
4,Available_ Reservd_ Space: the remaining reserved space on the SSD. The initial value is 100, which means 100%, and the threshold value is 10. Decreasing to 10 means that the reserved space can no longer be reduced
-----------------------------------
smartctl output details
https://blog.51cto.com/wenqiang/1434581
Author: goodness is like water_ 001
Link: https://www.jianshu.com/p/da26137065d9
Source: Jianshu
The copyright belongs to the author. For commercial reprint, please contact the author for authorization, and for non-commercial reprint, please indicate the source.