Help Docs Performance Server Optimization Using smartctl for hard drive diagnostics

Using smartctl for hard drive diagnostics

Diagnose drive health with `smartctl`. This guide covers installation, S.M.A.R.T. checks, self-tests, RAID considerations & cPanel alerts.

Understanding the health of your server’s hard drives is crucial for preventing data loss and ensuring optimal performance. S.M.A.R.T. (Self-Monitoring, Analysis, and Reporting Technology) is a monitoring system built into most modern hard disk drives (HDDs) and solid-state drives (SSDs). The smartctl command-line utility allows you to interact with the SMART system to check drive health, run self-tests, and retrieve valuable diagnostic information.

This guide will walk you through installing (if necessary) and using smartctl on your Linux or Windows server.

Prerequisites

Linux

smartmontools, the package containing smartctl, is typically pre-installed on Liquid Web managed servers. To check if it’s installed, you can try running:

smartctl -V

For RHEL-based systems (e.g., AlmaLinux, CentOS):

sudo yum install smartmontools #   orsudo dnf install smartmontools
  • For Debian-based systems (e.g., Ubuntu):
sudo apt update sudo apt install smartmontools

Windows (Dedicated Servers)

For Windows dedicated servers:

  1. Download the smartmontools package. A version is mirrored on Liquid Web’s SysRes files: http://files.sysres.liquidweb.com/smartmontools/smartmontools-6.6-1.win32-setup.exe (November 2017 release). For the latest version, visit the official site: https://www.smartmontools.org/wiki/Download#InstalltheWindowspackage
  2. Run the installer and follow the on-screen prompts.

Identifying drives

It’s important to run smartctl commands on the whole disk device (e.g., /dev/sda, /dev/sdb) rather than a partition (e.g., /dev/sda1).

Linux

Here are several commands to help you identify the available disk devices:

  • lsblk: Lists block devices in a tree-like format, showing disks and their partitions. lsblk
  • df -h: Shows disk space usage for mounted filesystems and their corresponding devices. df -h
  • sudo fdisk -l: Lists disk partitions. You may need sudo to run this command. sudo fdisk -l
  • mount: Shows all mounted filesystems. mount | sort
  • Identifying disks in Software RAID: If your server uses software RAID (md arrays), you can see which physical disks make up the array with: cat /proc/mdstat You will run smartctl on the member disks (e.g., /dev/sda, /dev/sdb), not the /dev/mdX device itself.

Windows

  • To list devices smartctl can scan: PowerShellsmartctl --scan This will output device names like /dev/sda, /dev/sdb, etc.
  • To correlate these with Windows Device IDs, you can use : PowerShell diskdrive get deviceid,model,serialnumber,size Typically, /dev/sda corresponds to \\.\PHYSICALDRIVE0, /dev/sdb to \\.\PHYSICALDRIVE1, and so on.

Basic smartctl commands

Replace /dev/sdX (Linux) or /dev/sda (Windows example) with the actual device name you identified.

  • Display All SMART Information: This command shows all SMART information for a drive, including its model, serial number, firmware version, power-on hours, SMART attributes, error logs, and self-test logs. sudo smartctl -a /dev/sdX For Windows (using Command Prompt or PowerShell): PowerShell smartctl -a /dev/sda If using PowerShell and the device path includes a comma (e.g., for some RAID controllers), you might need quotes and specify the device type: PowerShell smartctl -a -d ata "/dev/csmi0,0"
  • Check SMART Health Status Only: This provides a quick pass/fail status. sudo smartctl -H /dev/sdX Note: A “PASSED” status doesn’t always mean the drive is perfectly healthy, as some issues might not have crossed the failure threshold yet.
  • Enable SMART (if disabled): SMART is usually enabled by default. If it’s not: sudo smartctl -s on /dev/sdX

Running self-tests

SMART self-tests can help diagnose drive issues. It’s advisable to run these during periods of low server activity, as they can add some load to the disk.

  • Start a Short Self-Test: Checks the electrical and mechanical performance as well as the read performance of the disk. Typically takes 1-5 minutes. sudo smartctl -t short /dev/sdX
  • Start a Long Self-Test: A more thorough version of the short test, scanning the entire disk surface. This can take several hours depending on the drive size and speed. sudo smartctl -t long /dev/sdX
  • View Self-Test Progress/Log: After starting a test, you can check its progress or view the results of past tests. The output will show the estimated time remaining for the current test. sudo smartctl -l selftest /dev/sdX
  • Cancel/Abort a Running Self-Test: sudo smartctl -X /dev/sdX

Understanding smartctl output

The output of smartctl -a is divided into several sections:

Information section

Provides general information about the drive:

  • Device Model: The model number of the drive.
  • Serial Number: The unique serial number of the drive.
  • Firmware Version: The firmware version currently running on the drive.
  • User Capacity: The usable size of the drive.
  • Power_On_Hours (in Attributes section): How long the drive has been powered on in total.

SMART overall-health self-assessment test result

SMART overall-health self-assessment test result: PASSED

This line gives a quick summary. While PASSED is good, it doesn’t guarantee the drive is free of issues. A FAILED result almost certainly means the drive needs replacement.

SMART attributes table

This is a crucial section listing various drive health attributes. Here’s how to read the columns:

  • ID#: The attribute ID number.
  • ATTRIBUTE_NAME: A descriptive name for the attribute.
  • FLAG: Internal flags (not typically used for direct diagnosis).
  • VALUE: The current normalized value of the attribute, usually from 1 to 253. Higher is generally better.
  • WORST: The lowest VALUE ever recorded for this attribute.
  • THRESH: The threshold value. If the VALUE drops below this THRESH, the drive is considered to be in a pre-failure condition for that attribute.
  • TYPE: Indicates how the attribute behaves (e.g., Pre-fail, Old_age).
  • UPDATED: When the attribute is updated (e.g., Always, Offline).
  • WHEN_FAILED: If this attribute has ever crossed its threshold, this column will indicate when (e.g., FAILING_NOW or In_the_past). If it’s -, it has not failed.
  • RAW_VALUE: The raw, unnormalized data for the attribute. The meaning of this value varies widely between attributes and manufacturers. Some are direct counts (like Power_On_Hours), while others are complex or proprietary values.

Key Attributes to Monitor:

  • 5 Reallocated_Sector_Ct: Raw value indicates the count of remapped sectors. When the drive encounters a bad sector, it attempts to transfer the data to a special reserved spare area and mark the original sector as unusable. Any value greater than 0 is a concern and indicates physical media degradation.
  • 9 Power_On_Hours: Raw value is the number of hours the drive has been powered on. Useful for assessing drive age. (1 year = 8760 hours).
  • 187 Reported_Uncorrect: Raw value is the count of uncorrectable errors reported to the operating system. Any non-zero value that is increasing is a serious concern.
  • 190 Airflow_Temperature_Cel / 194 Temperature_Celsius: Raw value is the current drive temperature in Celsius. Consistently high temperatures can reduce drive lifespan.
  • 196 Reallocation_Event_Count: Raw value is the total number of attempts (both successful and unsuccessful) to transfer data from reallocated sectors to spare sectors.
  • 197 Current_Pending_Sector_Ct: Raw value indicates the count of “unstable” sectors that are waiting to be remapped. If these sectors are successfully read later, the count decreases. If they cannot be read, the drive will attempt to reallocate them, and Reallocated_Sector_Ct may increase. A non-zero value here is a warning sign.
  • 198 Offline_Uncorrectable / Uncorrectable_Sector_Ct: Raw value is the count of uncorrectable errors found during offline scans or normal operations. Any non-zero value is a serious concern.
  • 199 UDMA_CRC_Error_Count: Raw value indicates the number of CRC errors during data transfer between the host and the drive over the interface cable. This often points to a faulty SATA cable, bad connection, or issues with the host controller or drive electronics, rather than the disk media itself.

Note on RAW_VALUE Interpretation: Some RAW_VALUE fields, like Raw_Read_Error_Rate or Seek_Error_Rate, can show very large numbers even on healthy drives. Manufacturers use these fields differently, and they may not be direct error counts. Focus on the key attributes listed above and look for changes in VALUE relative to THRESH.

SMART error log

This section shows a log of the most recent errors detected by the drive. Each error entry typically includes:

  • Error number and power-on lifetime (in hours) when the error occurred.
  • The command that led to the error.
  • Register status indicating the type of error (e.g., UNC for Uncorrectable Error).

If errors are logged recently (relative to the drive’s current Power_On_Hours), it’s a strong indicator of an active problem.

SMART self-test log

This section lists the results of self-tests performed on the drive. Each entry shows:

  • Test type (e.g., Short offline, Extended offline).
  • Status (e.g., Completed without error, Failed..., Aborted by host).
  • Lifetime (LBA of first error if applicable) and remaining percentage.

A failed self-test is a clear indication that the drive should be replaced.

smartctl with software RAID (Linux)

If your server uses Linux software RAID (e.g., /dev/md0, /dev/md1):

  1. Identify the individual physical disks that make up the RAID array: cat /proc/mdstat Example output: Personalities : [raid1] md0 : active raid1 sda1[0] sdb1[1] 1048512 blocks super 1.2 [2/2] [UU] In this example, sda and sdb are the member disks.
  2. Run smartctl commands on each physical member disk (e.g., /dev/sda and /dev/sdb), not on the RAID array device (e.g., /dev/md0). sudo smartctl -a /dev/sda sudo smartctl -a /dev/sdb

smartctl with hardware RAID (Linux & Windows)

smartctl often cannot query SMART data directly through a hardware RAID controller. The controller presents a logical drive to the OS, not the individual physical drives.

  • Identifying the Controller: If you try to run smartctl on a device that is part of a hardware RAID array, you might see an error indicating the controller type, for example: Smartctl open device: /dev/sdb [megaraid_disk_00] failed: INQUIRY failed This indicates a MegaRAID controller.
  • Using Controller-Specific Utilities: It’s best to use the RAID controller’s own utilities to check the health of the array and its individual drives (e.g., MegaCli, perccli, storcli, tw_cli, arcconf, sas2ircu, sas3ircu). Please refer to documentation specific to your RAID controller. Liquid Web support can assist with this on managed servers.
  • Limited SMART Access (Pass-through): Some hardware RAID controllers support “pass-through” commands that allow smartctl to query individual drives behind the controller. This often requires specific syntax:
    • For LSI MegaRAID controllers: sudo smartctl -a -d megaraid,N /dev/sdX Where N is the disk ID as seen by the controller (e.g., 0, 1, 2…). /dev/sdX is often /dev/sg0, /dev/sg1 etc. (SCSI generic devices).For Adaptec AACRAID controllers (example for Windows PowerShell): PowerShell smartctl -a /dev/sda -d "aacraid,0,0,0" The exact parameters depend on the controller and driver.
    Checking the RAID array’s overall health status via the controller’s utility is typically the most reliable method for hardware RAID.

cPanel SMART email warnings

cPanel & WHM can monitor SMART status and send email notifications if potential issues are detected.

Airflow_Temperature_Cel warning

You might receive an email from cPanel with a subject like [cPanel smartd] Device: /dev/sda, SMART Attribute: Airflow_Temperature_Cel Changed or similar, indicating a temperature concern:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
190 Airflow_Temperature_Cel 0x0022   060   045   045    Old_age   Always   In_the_past 40 (Lifetime Min/Max 34/42)

This means the drive’s temperature exceeded a predefined threshold at some point (WHEN_FAILED In_the_past). The VALUE may have since returned to normal, but SMART logs the WORST value, and cPanel might continue to send notifications because the “worst” recorded value is still below the “threshold” in normalized terms, or simply because a WHEN_FAILED state was logged.

  • Action:
    1. Check current drive temperatures using smartctl -A /dev/sdX (look for Temperature_Celsius or Airflow_Temperature_Cel).
    2. Ensure server cooling is adequate (fans are operational, vents are clear).
    3. If temperatures are currently normal and the warning relates to a past event, you can monitor the drive. If the warnings are persistent or concerning, please contact Liquid Web support. We can help assess the situation. In some cases, cPanel’s smartd notifications for past temperature events can be managed, or if the drive is indeed problematic, replacement can be arranged.

Other cPanel SMART warnings

Important: If you receive any other SMART warning email from cPanel (e.g., for Reallocated_Sector_Ct, Current_Pending_Sector_Ct, etc.), it likely indicates a more serious underlying issue with the drive.

  • Action: Contact Liquid Web support immediately.
  • Please do not run self-tests (smartctl -t short /dev/sdX or smartctl -t long /dev/sdX) on drives that are already reporting errors through cPanel unless specifically instructed by our support team. Running tests can put additional stress on a failing drive and potentially accelerate failure. Our team will assess the smartctl -a output and advise on the best course of action.

Interpreting results & taking action

Interpreting SMART data requires careful consideration, as not every non-zero raw value or warning indicates imminent failure. However, certain patterns and attributes are strong indicators of trouble.

  • Focus on Key Failure Indicators: Pay close attention to Reallocated_Sector_Ct, Current_Pending_Sector_Ct, Offline_Uncorrectable, and Reported_Uncorrect. Non-zero and/or increasing values for these attributes are serious.
  • Look for Trends: Are error counts increasing over time? Are new errors appearing in the SMART Error Log?
  • Recent Errors in Log: Errors in the SMART Error Log that occurred recently (relative to the drive’s total Power_On_Hours) are more concerning than very old errors.
  • Failed Self-Tests: A failed self-test (SMART Self-test log) is a strong reason to replace the drive.
  • Drive Age (Power_On_Hours):
    • Traditional HDDs over 3-5 years of age (approx. 26,280 – 43,800 Power_On_Hours) have a statistically higher chance of failure. Consider proactive replacement for critical drives in this age range, especially if they show any other warning signs.
    • SSDs have different wear mechanisms (primarily write endurance, measured by attributes like Media_Wearout_Indicator or Percentage_Used), but age can still be a factor.
  • Example of a Failing Drive: A drive might show Reported_Uncorrect with a raw value of 18. The SMART Error Log might show recent Error: UNC at LBA... (Uncorrectable Read Error). If the drive’s current Power_On_Hours is very close to the hours logged for these errors, it indicates an ongoing problem. Combined with symptoms like server sluggishness or I/O errors in system logs, this would strongly suggest the drive needs replacement.

When to contact Liquid Web support

It’s always best to contact Liquid Web support if you are unsure about interpreting smartctl data or suspect a drive issue. Please be prepared to provide the output of sudo smartctl -a /dev/sdX.

Contact us if you observe:

  • SMART overall-health self-assessment test result: FAILED.
  • Non-zero and/or increasing values for key attributes: Reallocated_Sector_Ct, Current_Pending_Sector_Ct, Offline_Uncorrectable, Reported_Uncorrect.
  • Failed SMART self-tests.
  • Recent errors in the SMART Error Log.
  • You receive a critical SMART warning from cPanel.
  • The server is exhibiting symptoms of drive failure (slow I/O, file system errors, unexpected hangs) and smartctl data looks suspicious.

Gathering comprehensive drive information (Linux)

If requested by support, or for your own detailed diagnostics, the following commands can provide a comprehensive overview of your disk subsystem. It’s often helpful to save this output to a file.

echo "== [ Hostname ] =="
hostname

# Get SMART data for all SATA/SCSI drives
# Run these one by one for each relevant drive if the loop doesn't suit your needs
for i in $(ls /dev/[hs]d? 2>/dev/null); do 
  echo -e "\n== [ SMART data for $i ] =="
  sudo smartctl -a $i
done

echo -e "\n== [ Mount Points ] =="
mount | sort

echo -e "\n== [ Filesystem Disk Space Usage ] =="
df -h

echo -e "\n== [ Disk IDs ] =="
ls -l /dev/disk/by-id/

# If using software RAID
if [ -f /proc/mdstat ]; then
  echo -e "\n== [ Software RAID Status (mdstat) ] =="
  cat /proc/mdstat
fi

echo -e "\n== [ Block Devices (lsblk) ] =="
lsblk

You can redirect the output of these commands to a file:

{ # Start of command group
echo "== [ Hostname ] =="
hostname
for i in $(ls /dev/[hs]d? 2>/dev/null); do 
  echo -e "\n== [ SMART data for $i ] =="
  sudo smartctl -a $i
done
echo -e "\n== [ Mount Points ] =="
mount | sort
echo -e "\n== [ Filesystem Disk Space Usage ] =="
df -h
echo -e "\n== [ Disk IDs ] =="
ls -l /dev/disk/by-id/
if [ -f /proc/mdstat ]; then
  echo -e "\n== [ Software RAID Status (mdstat) ] =="
  cat /proc/mdstat
fi
echo -e "\n== [ Block Devices (lsblk) ] =="
lsblk
} > ~/drive_diagnostics.txt # End of command group, output redirected

Then, you can view ~/drive_diagnostics.txt or provide it to support.

Conclusion

smartctl is an invaluable tool for monitoring the health of your server’s storage devices. By understanding how to use it and interpret its output, you can proactively identify potential drive issues and take steps to prevent data loss. When in doubt, especially with managed servers, always reach out to Liquid Web support for assistance with drive diagnostics and replacement

Was this article helpful?