Metrics and Alert rules provided by Hardware Observer for S.M.A.R.T.

Metrics

The details of the S.M.A.R.T. metrics exposed by Hardware Observer using the smartctl_exporter are as follows:

Metric Name Description Labels
smartctl_device Device Info ata_additional_product_id, device, ata_version, firmware_version, form_factor, interface, model_family, model_name, protocol, sata_version, scsi_vendor, scsi_product, scsi serial_number, scsi_revision
smartctl_devices Number of devices configured or dynamically discovered
smartctl_device_attribute Device attributes attribute_flags_long, attribute_flags_short, attribute_id, attribute_name, attribute_value_type, device
smartctl_device_available_spare Normalized percentage (0 to 100%) of the remaining spare capacity available device
smartctl_device_available_spare_threshold When the Available Spare falls below the threshold indicated in this field, an asynchronous event completion may occur. The value is indicated as a normalized percentage (0 to 100%) device
smartctl_device_block_size Device block size blocks_type, device
smartctl_device_bytes_read device
smartctl_device_bytes_written device
smartctl_device_capacity_blocks Device capacity in blocks device
smartctl_device_capacity_bytes Device capacity in bytes device
smartctl_device_nvme_capacity_bytes NVMe device total capacity bytes device
smartctl_device_critical_warning This field indicates critical warnings for the state of the controller device
smartctl_device_interface_speed Device interface speed, bits per second device, speed_type
smartctl_device_media_errors Contains the number of occurrences where the controller detected an unrecovered data integrity error. Errors such as uncorrectable ECC, CRC checksum failure, or LBA tag mismatch are included in this field device
smartctl_device_num_err_log_entries Contains the number of Error Information log entries over the life of the controller device
smartctl_device_error_log_count Device S.M.A.R.T. error log count device, error_log_type
smartctl_device_percentage_used Contains a vendor specific estimate of the percentage of NVM subsystem life used. A value of 100 indicates that the estimated endurance of the NVM in the NVM subsystem has been consumed, but may not indicate an NVM subsystem failure. The value is allowed to exceed 100. Percentages greater than 254 shall be represented as 255. This value shall be updated once per power-on hour (when the controller is not in a sleep state). device
smartctl_device_power_cycle_count Device power cycle count device
smartctl_device_power_on_seconds Device power on seconds device
smartctl_device_rotation_rate Device rotation rate device
smartctl_device_smart_status General S.M.A.R.T. status device
smartctl_device_smartctl_exit_status Exit status of smartctl on device device
smartctl_device_statistics Device statistics device, statistic_table, statistic_name, statistic_flags_short, statistic_flags_long
smartctl_device_temperature Device temperature celsius device, temperature_type
smartctl_version smartctl version build_info, json_format_version, smartctl_version, svn_revision
smartctl_device_self_test_log_count Device S.M.A.R.T. self test log count device, self_test_log_type
smartctl_device_self_test_log_error_count Device S.M.A.R.T. self test log error count device, self_test_log_type
smartctl_device_erc_seconds Device S.M.A.R.T. Error Recovery Control Seconds device, op_type
smartctl_scsi_grown_defect_list Device SCSI grown defect list counter device
smartctl_read_errors_corrected_by_rereads_rewrites Read Errors Corrected by ReReads/ReWrites device
smartctl_read_errors_corrected_by_eccfast Read Errors Corrected by ECC Fast device
smartctl_write_errors_corrected_by_eccdelayed Write Errors Corrected by ECC Delayed device
smartctl_write_total_uncorrected_errors Write Total Uncorrected Errors device

Alerts

The details of the alerts that are provided by Hardware Observer for S.M.A.R.T. are as follows:

Alert Rule Name Description Severity
SmartctlCriticalWarning Critical warnings present for the state of the controller Critical
SmartctlDeviceSmartStatusFail S.M.A.R.T. status for device is 0 Critical
SmartctlExitStatusFail Non-zero exit status for smartctl command Warning
SmartclDeviceAttributeFailureWarning S.M.A.R.T. attributes correlating strongly with failure have been detected Warning