hardware-observer docs - index

Hardware Observer

Hardware-observer is a subordinate machine charm that provides monitoring and alerting of hardware resources on bare-metal infrastructure. This charm leverages the following exporters to provide detailed metrics:

  • Hardware Exporter: For collecting metrics from BMCs and RAID controllers.

  • Smartctl Exporter: For collecting SMART metrics from storage devices.

  • DCGM Exporter. For collecting metrics from NVIDIA GPUs (if present)

This charm is ideal for monitoring hardware resources when used in conjunction with the Canonical Observability Stack.

Hardware Exporter

Hardware-observer collects and exports Prometheus metrics from BMCs (using the IPMI and newer Redfish protocols) and various SAS and RAID controllers through the use of the prometheus-hardware-exporter project. It additionally configures Prometheus alert rules that are fired when the status of any metric is suboptimal.

Appropriate collectors and alert rules are installed based on the availability of one or more of the RAID/SAS controllers mentioned below:

  • Broadcom MegaRAID controller

  • Dell PowerEdge RAID Controller

  • LSI SAS-2 controller

  • LSI SAS-3 controller

  • HPE Smart Array controller

Smartctl Exporter

The Smartctl Exporter integrates with the Hardware-observer to provide monitoring of storage device health via SMART data. Metrics are collected and exported to Prometheus using the smartctl-exporter-snap.

DCGM Exporter

NOTE: requires revision ≥ 113

The DCGM exporter integrates with Hardware Observer to monitor NVIDIA GPUs by collecting various metrics. These metrics are then exported to Prometheus using the DCGM snap, enabling GPU performance tracking and monitoring. The snap is only installed if the charm detects the presence of NVIDIA GPUs.

Security, bugs and feature request

If you find a bug in this application or want to request a specific feature, here are the useful links:

  • Raise issues or feature requests in Github.
  • Security issues in Hardware Observer can be reported through LaunchPad. Please do not file GitHub issues about security issues.

Contributing

Please see the Juju SDK docs for guidelines on enhancements to this charm following best practice guidelines, and CONTRIBUTING.md for developer guidance.

License

Hardware Observer is free software, distributed under the Apache Software License, version 2.0. See LICENSE for more information.

Navigation

Mapping table
Level Path Navlink
1 tutorial Tutorial
1 how-to How to
2 integrate-with-cos Integrate with COS
2 monitor-hw-raid-controller Monitor hardware RAID controllers
2 migrate-from-hw-health Migrate from hw-health
1 explanation Explanation
2 hw-support-detection Hardware support detection
2 charm-lifecycle Charm lifecycle
2 exporters Exporters
2 cryptography Cryptography
1 reference Reference
2 resources Resources
2 configurations Configurations
2 integrations Integrations
2 logs Logs
2 dashboards Dashboards
2 metrics-and-alerts Metrics and alerts
3 metrics-and-alerts-common Common
3 metrics-and-alerts-ipmi IPMI
3 metrics-and-alerts-redfish Redfish
3 metrics-and-alerts-megaraid MegaRAID
3 metrics-and-alerts-poweredge PowerEdge RAID
3 metrics-and-alerts-sas LSI SAS
3 metrics-and-alerts-hpe HPE Smart Array
3 metrics-and-alerts-smart S.M.A.R.T.
3 metrics-and-alerts-gpu GPU

Good! The document is clear and organized.

1 Like