Hardware Observer
Hardware-observer is a subordinate machine charm that provides monitoring and alerting of hardware resources on bare-metal infrastructure. This charm leverages the following exporters to provide detailed metrics:
-
Hardware Exporter: For collecting metrics from BMCs and RAID controllers.
-
Smartctl Exporter: For collecting SMART metrics from storage devices.
-
DCGM Exporter. For collecting metrics from NVIDIA GPUs (if present)
This charm is ideal for monitoring hardware resources when used in conjunction with the Canonical Observability Stack.
Hardware Exporter
Hardware-observer collects and exports Prometheus metrics from BMCs (using the IPMI and newer Redfish protocols) and various SAS and RAID controllers through the use of the prometheus-hardware-exporter project. It additionally configures Prometheus alert rules that are fired when the status of any metric is suboptimal.
Appropriate collectors and alert rules are installed based on the availability of one or more of the RAID/SAS controllers mentioned below:
-
Broadcom MegaRAID controller
-
Dell PowerEdge RAID Controller
-
LSI SAS-2 controller
-
LSI SAS-3 controller
-
HPE Smart Array controller
Smartctl Exporter
The Smartctl Exporter integrates with the Hardware-observer to provide monitoring of storage device health via SMART data. Metrics are collected and exported to Prometheus using the smartctl-exporter-snap.
DCGM Exporter
NOTE: requires revision ≥ 113
The DCGM exporter integrates with Hardware Observer to monitor NVIDIA GPUs by collecting various metrics. These metrics are then exported to Prometheus using the DCGM snap, enabling GPU performance tracking and monitoring. The snap is only installed if the charm detects the presence of NVIDIA GPUs.
Security, bugs and feature request
If you find a bug in this application or want to request a specific feature, here are the useful links:
- Raise issues or feature requests in Github.
- Security issues in Hardware Observer can be reported through LaunchPad. Please do not file GitHub issues about security issues.
Contributing
Please see the Juju SDK docs for guidelines on enhancements to this charm following best practice guidelines, and CONTRIBUTING.md for developer guidance.
License
Hardware Observer is free software, distributed under the Apache Software License, version 2.0. See LICENSE for more information.
Navigation
Mapping table
Level | Path | Navlink |
---|---|---|
1 | tutorial | Tutorial |
1 | how-to | How to |
2 | integrate-with-cos | Integrate with COS |
2 | monitor-hw-raid-controller | Monitor hardware RAID controllers |
2 | migrate-from-hw-health | Migrate from hw-health |
1 | explanation | Explanation |
2 | hw-support-detection | Hardware support detection |
2 | charm-lifecycle | Charm lifecycle |
2 | exporters | Exporters |
2 | cryptography | Cryptography |
1 | reference | Reference |
2 | resources | Resources |
2 | configurations | Configurations |
2 | integrations | Integrations |
2 | logs | Logs |
2 | dashboards | Dashboards |
2 | metrics-and-alerts | Metrics and alerts |
3 | metrics-and-alerts-common | Common |
3 | metrics-and-alerts-ipmi | IPMI |
3 | metrics-and-alerts-redfish | Redfish |
3 | metrics-and-alerts-megaraid | MegaRAID |
3 | metrics-and-alerts-poweredge | PowerEdge RAID |
3 | metrics-and-alerts-sas | LSI SAS |
3 | metrics-and-alerts-hpe | HPE Smart Array |
3 | metrics-and-alerts-smart | S.M.A.R.T. |
3 | metrics-and-alerts-gpu | GPU |