Nowadays, losing important data may be more disastrous than losing a tangible asset. In some cases, a business may collapse due to a loss of vital data. A problem like that is usually caused by a sudden failure of storage media. Actually, such failures are not always sudden. Hard disk drives usually fail due to a slow degradation of their mechanical parts and magnetic platters. Just like a person, a hard drive may “feel unwell” for a long time before it eventually fails. System administrators must watch for the signs of a hard drive’s imminent failure, so that they can replace it in good time after copying valuable data to healthy storage media.
Thankfully, hard drives are capable of self-diagnostics. That capability is called S.M.A.R.T. (Self-Monitoring, Analysis, and Reporting Technology). The early hard drive self-diagnostics standards were jointly developed by major hard drive vendors back in 1995. Since then, the S.M.A.R.T. technology has been improved a lot. Each hard drive vendor defines a set of “S.M.A.R.T. Attributes” it deems to be important. When a hard drive is turned on, it starts automatically checking and updating its S.M.A.R.T. attributes. Attribute values are stored in a special disk area that can only be accessed by the hard drive’s firmware. Each hard drive vendor also sets threshold values that should not be normally passed.
Among other things, each system administrator must regularly check critical S.M.A.R.T. attributes and make sure that they stay within safe limits. There are special applications that can read S.M.A.R.T. attributes and display them in a user-friendly format. But when you are overwhelmed by routine work, it’s easy to forget about checking the hard drive health attributes and miss the onset of a critical situation, such as an exponential increasing of read/write errors, seak errors, or reallocated sectors. That’s why we recommend that you use a monitoring system that can automatically check S.M.A.R.T. attributes in the 24/7 mode and immediately notify you about any abnormal situation.
10-Strike Network Monitor Pro is a multi-purpose software product that can monitor many different things, including S.M.A.R.T. attributes. You need to install it on one of your servers, and then install its agents on the computers whose hard drives you want to monitor. After that, Network Monitor will scan the network, find any available hosts, and add them to the monitoring list. Then you need to create the check called “S.M.A.R.T.” for the hosts with the agent installed on, and it will run at regular intervals. The time interval can be in the range from a few seconds to several hours. The monitoring core will analyze the values received from each agent and compare them with the threshold values. If the threshold is passed, Network Monitor can notify you about that by sending an SMS text message, email message, etc.
S.M.A.R.T. Parameter Monitoring Setup
It’s easy to create a “S.M.A.R.T.” check to monitor hard drive health attributes. You need to do the following:
Install 10-Strike Network Monitor Pro.
Install Network Monitor’s agent on each host whose hard drive you want to monitor. The agent is a Windows service that will read S.M.A.R.T. attributes and send them to the monitoring core.
Launch Network Monitor and run a network scan, or manually add the hosts.
Select the host in the tree (at the left), and then select “Add check” in its context menu.
In the “Check parameters” dialog box, select the check type (“S.M.A.R.T.”). After that, click the “…” button to the right of the “Hard drive” drop-down list, and then select the hard drive in the list.
Select the S.M.A.R.T. attribute that you want to monitor (for example, “Temperature” or “Seek Error Rate”). To do that, click the “…” button to the right of the “Attribute value (RAW)” drop-down list, and then select its name in the list.
Set the alarm triggering conditions (for example, “The check is successful if the attribute value is less than 50”).
Set the check start parameters or keep the default ones. Set the notification options, and save the changes.
As soon as you add a new check, it will start collecting data. You can watch the data collection process by switching to the “Monitored parameter” tab (click the tab at the bottom of the window). Network Monitor will display a graph of the S.M.A.R.T. attribute you selected.
Below you can find a list of the most important S.M.A.R.T. attributes that you should monitor. Note that the choice of attributes depends on the storage media. A solid state drive (SSD) vendor will not provide the same S.M.A.R.T. attributes as a hard drive vendor does.
· #01 Raw Read Error Rate — The rate of hardware read errors that occurred when reading data from the magnetic platter.
· #03 Spin-Up Time — – The average spindle spin-up time. An increase of the value indicates mechanical wear (for example, an increased friction in the bearing) or inadequate power supply (for example, the voltage drops each time the spindle starts to spin up).
· #05 Reallocated Sectors Count — This attribute represents the number of bad sectors that have been automatically reallocated by the hard drive’s firmware. Each time a read/write error is detected, the firmware marks the sector as remapped and transfers data from it to the reserved area. The raw value is the total number of reallocated sectors. The higher the value, the worse the condition of the platter surface.
· #07 Seek Error Rate — The rate of seek errors of the magnetic heads. An increase of the value indicates a poor condition of the platter surface or a partial failure in the mechanical positioning system. It may also be caused by overheating or external vibration (for example, from other hard drives in that computer).
· #10 Spin-Up Retry Count — The number of repeated spin-up attempts. An increase of the value indicates a problem in the hard drive’s mechanical subsystem.
· #196 Reallocation Event Count — The number of remap operations. This attribute represents the total number of attempts to transfer data from bad sectors to the reserved area (which is usually small, just several thousand of sectors). Both successful and unsuccessful attempts are counted.
· #197 Current Pending Sector Count — The number of “unstable” sectors that have to be remapped due to unrecoverable read errors. These sectors have not been marked as “bad” yet, but reading data from them is unreliable (for example, it may take a few attempts). Later on, if an unstable sector is read successfully, it will be marked as “good”; otherwise, the hard drive’s firmware will try to remap it.
· #198 Uncorrectable Sector Count — The total number of sectors that cannot be read from or written to. An increase of the value indicates major defects of the magnetic platter surface or problems in the mechanical subsystem.
· #220 Disk Shift — The distance the magnetic platters have shifted relative to the spindle. This problem is mostly caused by a mechanical shock. A major increase of the value means the hard drive is unusable.
The S.M.A.R.T. attribute monitoring is not the silver bullet. The S.M.A.R.T. technology is mostly intended to help you notice a deterioration of critical hard drive parameters in good time, so that you can replace the failing storage media before a data loss occurs. But if you back up your data on a regular basis and monitor your hard drives’ S.M.A.R.T. attributes, you can minimize your data loss risks and avoid data recovery costs. Keep in mind this rule: Good S.M.A.R.T. attributes do not guarantee that the hard drive is absolutely healthy, but poor ones are usually a telltale sign of a problem.
For testing the performance issues in your specific environment, we recommend you to download and try our fully functional free 30-day trial version.