Metrics
The details of the GPU metrics exposed by Hardware Observer using dcgm-exporter and node-exporter are as follows:
Metric Name | Description | Labels |
---|---|---|
DCGM_FI_DEV_GPU_TEMP | GPU temperature (in C) | DCGM_FI_DEV_BAR1_TOTAL, DCGM_FI_DEV_BRAND, DCGM_FI_DEV_CC_MODE, DCGM_FI_DEV_COMPUTE_MODE, DCGM_FI_DEV_COUNT, DCGM_FI_DEV_CUDA_COMPUTE_CAPABILITY, DCGM_FI_DEV_ECC_CURRENT, DCGM_FI_DEV_ECC_INFOROM_VER, DCGM_FI_DEV_ENFORCED_POWER_LIMIT, DCGM_FI_DEV_FB_TOTAL, DCGM_FI_DEV_GPU_MAX_OP_TEMP, DCGM_FI_DEV_INFOROM_IMAGE_VER, DCGM_FI_DEV_MAX_MEM_CLOCK, DCGM_FI_DEV_MAX_SM_CLOCK, DCGM_FI_DEV_MINOR_NUMBER, DCGM_FI_DEV_NAME, DCGM_FI_DEV_OEM_INFOROM_VER, DCGM_FI_DEV_PERSISTENCE_MODE, DCGM_FI_DEV_POWER_MGMT_LIMIT, DCGM_FI_DEV_POWER_MGMT_LIMIT_MAX, DCGM_FI_DEV_POWER_MGMT_LIMIT_MIN, DCGM_FI_DEV_SERIAL, DCGM_FI_DEV_SHUTDOWN_TEMP, DCGM_FI_DEV_SLOWDOWN_TEMP, DCGM_FI_DEV_VBIOS_VERSION, DCGM_FI_DEV_VIRTUAL_MODE, DCGM_FI_DRIVER_VERSION, DCGM_FI_NVML_VERSION, Hostname, UUID, device, gpu, modelName, pci_bus_id |
DCGM_FI_DEV_POWER_USAGE | Power draw (in W) | Same as DCGM_FI_DEV_GPU_TEMP |
DCGM_FI_DEV_GPU_UTIL | GPU utilization (in %) | Same as DCGM_FI_DEV_GPU_TEMP |
DCGM_FI_DEV_FAN_SPEED | Fan speed (in 0-100%) | Same as DCGM_FI_DEV_GPU_TEMP |
DCGM_FI_DEV_MEM_CLOCK | Memory clock frequency (in MHz) | Same as DCGM_FI_DEV_GPU_TEMP |
DCGM_FI_DEV_MEM_COPY_UTIL | Memory utilization (in %) | Same as DCGM_FI_DEV_GPU_TEMP |
DCGM_FI_DEV_CLOCK_THROTTLE_REASONS | Throttling reasons bitmask | Same as DCGM_FI_DEV_GPU_TEMP |
node_hwmon_chip_names | Annotation metric for human-readable chip names | chip, chip_name |
node_hwmon_temp_celsius | Hardware monitor for temperature (input) | chip, sensor |
node_hwmon_power_average_watt | Hardware monitor for power usage in watts (average) | chip, sensor |
node_hwmon_freq_freq_mhz | Hardware monitor for GPU frequency in MHz | sensor, chip |
node_hwmon_fan_rpm | Hardware monitor for fan revolutions per minute (input) | sensor, chip |
node_hwmon_fan_max_rpm | Hardware monitor for fan revolutions per minute (max) | sensor, chip |
node_drm_card_info | Card information | card, chip, memory_vendor, power_performance_level, unique_id |
node_drm_gpu_busy_percent | How busy the GPU is as a percentage | card, chip |
node_drm_memory_vram_used_bytes | The used amount of VRAM in bytes | card, chip |
node_drm_memory_vram_size_bytes | The size of VRAM in bytes | card, chip |
NOTE: This is the subset of metrics used for alerts and the GPU dashboard. Please see this file to learn about other DCGM metrics.
NOTE: metrics prefixed with node_
are provided by the node_exporter
DRM
and HWmon
collectors for any GPU using open-source drivers. node_exporter
is deployed by the grafana-agent charm, not hardware-observer. The metrics are reported here for convenience.
Alerts
The details of the alerts that Hardware Observer provides for NVIDIA GPUs are as follows:
Alert Rule Name | Description | Severity |
---|---|---|
GPUPowerBrakeThrottle | NVIDIA GPU Hardware Power Brake Slowdown throttling detected | Warning |
GPUThermalHWThrottle | NVIDIA GPU Hardware Thermal throttling detected | Warning |
GPUThermalSWThrottle | NVIDIA GPU Software Thermal throttling detected | Warning |
GPUSyncBoostThrottle | NVIDIA GPU Sync Boost throttling detected | Warning |
GPUSlowdownThrottle | GPU Hardware Slowdown throttling detected | Warning |
GPUPowerThrottle | GPU Software Power throttling detected | Warning |
For more details, please see NVIDIA Clocks Throttle reasons.
Throttling detection is currently only available for NVIDIA GPUs.