Metrics and Alert rules provided by Hardware Observer for GPU

Metrics

The details of the GPU metrics exposed by Hardware Observer using dcgm-exporter and node-exporter are as follows:

Metric Name Description Labels
DCGM_FI_DEV_GPU_TEMP GPU temperature (in C) DCGM_FI_DEV_BAR1_TOTAL, DCGM_FI_DEV_BRAND, DCGM_FI_DEV_CC_MODE, DCGM_FI_DEV_COMPUTE_MODE, DCGM_FI_DEV_COUNT, DCGM_FI_DEV_CUDA_COMPUTE_CAPABILITY, DCGM_FI_DEV_ECC_CURRENT, DCGM_FI_DEV_ECC_INFOROM_VER, DCGM_FI_DEV_ENFORCED_POWER_LIMIT, DCGM_FI_DEV_FB_TOTAL, DCGM_FI_DEV_GPU_MAX_OP_TEMP, DCGM_FI_DEV_INFOROM_IMAGE_VER, DCGM_FI_DEV_MAX_MEM_CLOCK, DCGM_FI_DEV_MAX_SM_CLOCK, DCGM_FI_DEV_MINOR_NUMBER, DCGM_FI_DEV_NAME, DCGM_FI_DEV_OEM_INFOROM_VER, DCGM_FI_DEV_PERSISTENCE_MODE, DCGM_FI_DEV_POWER_MGMT_LIMIT, DCGM_FI_DEV_POWER_MGMT_LIMIT_MAX, DCGM_FI_DEV_POWER_MGMT_LIMIT_MIN, DCGM_FI_DEV_SERIAL, DCGM_FI_DEV_SHUTDOWN_TEMP, DCGM_FI_DEV_SLOWDOWN_TEMP, DCGM_FI_DEV_VBIOS_VERSION, DCGM_FI_DEV_VIRTUAL_MODE, DCGM_FI_DRIVER_VERSION, DCGM_FI_NVML_VERSION, Hostname, UUID, device, gpu, modelName, pci_bus_id
DCGM_FI_DEV_POWER_USAGE Power draw (in W) Same as DCGM_FI_DEV_GPU_TEMP
DCGM_FI_DEV_GPU_UTIL GPU utilization (in %) Same as DCGM_FI_DEV_GPU_TEMP
DCGM_FI_DEV_FAN_SPEED Fan speed (in 0-100%) Same as DCGM_FI_DEV_GPU_TEMP
DCGM_FI_DEV_MEM_CLOCK Memory clock frequency (in MHz) Same as DCGM_FI_DEV_GPU_TEMP
DCGM_FI_DEV_MEM_COPY_UTIL Memory utilization (in %) Same as DCGM_FI_DEV_GPU_TEMP
DCGM_FI_DEV_CLOCK_THROTTLE_REASONS Throttling reasons bitmask Same as DCGM_FI_DEV_GPU_TEMP
node_hwmon_chip_names Annotation metric for human-readable chip names chip, chip_name
node_hwmon_temp_celsius Hardware monitor for temperature (input) chip, sensor
node_hwmon_power_average_watt Hardware monitor for power usage in watts (average) chip, sensor
node_hwmon_freq_freq_mhz Hardware monitor for GPU frequency in MHz sensor, chip
node_hwmon_fan_rpm Hardware monitor for fan revolutions per minute (input) sensor, chip
node_hwmon_fan_max_rpm Hardware monitor for fan revolutions per minute (max) sensor, chip
node_drm_card_info Card information card, chip, memory_vendor, power_performance_level, unique_id
node_drm_gpu_busy_percent How busy the GPU is as a percentage card, chip
node_drm_memory_vram_used_bytes The used amount of VRAM in bytes card, chip
node_drm_memory_vram_size_bytes The size of VRAM in bytes card, chip

NOTE: This is the subset of metrics used for alerts and the GPU dashboard. Please see this file to learn about other DCGM metrics.

NOTE: metrics prefixed with node_ are provided by the node_exporter DRM and HWmon collectors for any GPU using open-source drivers. node_exporter is deployed by the grafana-agent charm, not hardware-observer. The metrics are reported here for convenience.

Alerts

The details of the alerts that Hardware Observer provides for NVIDIA GPUs are as follows:

Alert Rule Name Description Severity
GPUPowerBrakeThrottle NVIDIA GPU Hardware Power Brake Slowdown throttling detected Warning
GPUThermalHWThrottle NVIDIA GPU Hardware Thermal throttling detected Warning
GPUThermalSWThrottle NVIDIA GPU Software Thermal throttling detected Warning
GPUSyncBoostThrottle NVIDIA GPU Sync Boost throttling detected Warning
GPUSlowdownThrottle GPU Hardware Slowdown throttling detected Warning
GPUPowerThrottle GPU Software Power throttling detected Warning

For more details, please see NVIDIA Clocks Throttle reasons.

Throttling detection is currently only available for NVIDIA GPUs.