hardware-observer docs - DCGM

DCGM snap tracks

DCGM snap is splitted into different tracks depending on the CUDA version. Each driver version corresponds to a specific CUDA version. Check the upstream documentation for more information.

Track CUDA Versions
v3 10, 11, 12
v4-cuda-11 11
v4-cuda-12 12
v4-cuda-13 13

DCGM V3

From revision 113 to 570, the charm config dcgm-snap-channel defaults to latest/stable which installs DCGM v3 and is compatible with CUDA versions 10, 11 and 12.

DCGM V4

DCGM v4 is no longer shipped in a single monolithic package. Instead, installation assets have been split among several packages, allowing clients to opt-out of the installation of assets not applicable to their use case.

dcgm-snap-channel options

auto

Beginning from revision 571, the default value for dcgm-snap-channel is auto. With this option the charm automatically detects the installed NVIDIA driver version and selects the appropriate snap track.

v3

If for some reason users want to be strictly on v3, it’s possible to set the dcgm-snap-channel to v3/stable, v3/candidate or v3/edge . This will force units to use DCGM on v3. If for some reason the snap is already running on v4 or incompatible with v3, the unit will set blocked status.

v4

The dcgm-snap-channel accepts v4/stable, v4/candidate or v4/edge. This option forces units to use DCGM v4 and if the snap is running v3 or incompatible with v4, the unit will set blocked status.

Upgrade Behavior

When upgrading to revision > 571 if the default value of dcgm-snap-channel was not changed, the charm will switch to auto.

This means the DCGM snap will automatically upgrade to v4 with the corresponding track based on the detected driver version.

If the installed driver is CUDA 10 compatible, the charm will continue using v3/stable instead of upgrading to v4.