Kubernetes-worker charm #788. NVidia driver install fails

juvvi · 27 August 2021 16:09

I had to tear down my k8s install on a local MAAS cloud and redo it yesterday. MAAS sets up a Ubuntu 20.04 cloud.

When I deploy charmed kubernetes, I see that on the gpu node, “containerd/0” is in an error state because NVidia driver installation fails. The containerd/0 logs on that machine show the following.

2021-08-27 15:39:53 INFO unit.containerd/0.juju-log server.go:314 status-set: maintenance: Installing Nvidia drivers.
2021-08-27 15:39:56 WARNING unit.containerd/0.install logger.go:60 E: Failed to fetch https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/by-hash/SHA256/751939d95516afc289908a19e447f0acc1506367f72ed356431a2b1a469cc8ca  404  Not Found [IP: 192.168.23.1 8000]
2021-08-27 15:39:56 WARNING unit.containerd/0.install logger.go:60 E: Some index files failed to download. They have been ignored, or old ones used instead.
2021-08-27 15:39:56 INFO unit.containerd/0.juju-log server.go:314 Installing ['cuda-drivers', 'nvidia-container-runtime'] with options: ['--option=Dpkg::Options::=--force-confold']
2021-08-27 15:39:57 WARNING unit.containerd/0.install logger.go:60 E: Unable to locate package cuda-drivers
2021-08-27 15:39:57 INFO unit.containerd/0.juju-log server.go:314 Couldn't acquire DPKG lock. Will retry in 10 seconds
2021-08-27 15:40:07 WARNING unit.containerd/0.install logger.go:60 E: Unable to locate package cuda-drivers
2021-08-27 15:40:07 INFO unit.containerd/0.juju-log server.go:314 Couldn't acquire DPKG lock. Will retry in 10 seconds
2021-08-27 15:40:17 WARNING unit.containerd/0.install logger.go:60 E: Unable to locate package cuda-drivers
2021-08-27 15:40:17 INFO unit.containerd/0.juju-log server.go:314 Couldn't acquire DPKG lock. Will retry in 10 seconds
2021-08-27 15:40:27 WARNING unit.containerd/0.install logger.go:60 E: Unable to locate package cuda-drivers
2021-08-27 15:40:27 ERROR unit.containerd/0.juju-log server.go:314 Hook error:
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-containerd-0/.venv/lib/python3.8/site-packages/charms/reactive/__init__.py", line 74, in main
    bus.dispatch(restricted=restricted_mode)
  File "/var/lib/juju/agents/unit-containerd-0/.venv/lib/python3.8/site-packages/charms/reactive/bus.py", line 390, in dispatch
    _invoke(other_handlers)
  File "/var/lib/juju/agents/unit-containerd-0/.venv/lib/python3.8/site-packages/charms/reactive/bus.py", line 359, in _invoke
    handler.invoke()
  File "/var/lib/juju/agents/unit-containerd-0/.venv/lib/python3.8/site-packages/charms/reactive/bus.py", line 181, in invoke
    self._action(*args)
  File "/var/lib/juju/agents/unit-containerd-0/charm/reactive/containerd.py", line 389, in configure_nvidia
    apt_install(NVIDIA_PACKAGES, fatal=True)
  File "/var/lib/juju/agents/unit-containerd-0/.venv/lib/python3.8/site-packages/charmhelpers/fetch/ubuntu.py", line 303, in apt_install
    _run_apt_command(cmd, fatal, quiet=quiet)
  File "/var/lib/juju/agents/unit-containerd-0/.venv/lib/python3.8/site-packages/charmhelpers/fetch/ubuntu.py", line 804, in _run_apt_command
    _run_with_retries(
  File "/var/lib/juju/agents/unit-containerd-0/.venv/lib/python3.8/site-packages/charmhelpers/fetch/ubuntu.py", line 781, in _run_with_retries
    result = subprocess.check_call(cmd, env=env, **kwargs)
  File "/usr/lib/python3.8/subprocess.py", line 364, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['apt-get', '--assume-yes', '--option=Dpkg::Options::=--force-confold', 'install', 'cuda-drivers', 'nvidia-container-runtime']' returned non-zero exit status 100.

If I check that 404 URL: https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/by-hash/SHA256/751939d95516afc289908a19e447f0acc1506367f72ed356431a2b1a469cc8ca manually, it is indeed a 404.

The error looks to be in the apt eco system. I am looking at the reactive hook for containerd to see if I can do something manually. In the meantime, I would appreciate any advice in how to move past this

juvvi · 27 August 2021 16:34

Ok, manual intervention worked and juju status shows all green. Since the charm was working as of Aug 24, I am assuming this is a temp apt eco system thing. This is what I did

juju ssh containerd/0 (or whatever node fails for you)
follow instructions for manual nvidia driver installation except, instead of installing cuda, installed whatever the reactive hook does: cuda-drivers and nvidia-container-runtime
reboot node

The Ubuntu 20.04 instructions for a local deb (in case a charm developer can compare with what the reactive code is doing)

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-ubuntu2004.pin
sudo mv cuda-ubuntu2004.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/11.4.1/local_installers/cuda-repo-ubuntu2004-11-4-local_11.4.1-470.57.02-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu2004-11-4-local_11.4.1-470.57.02-1_amd64.deb
sudo apt-key add /var/cuda-repo-ubuntu2004-11-4-local/7fa2af80.pub
sudo apt-get update
-sudo apt-get -y install cuda
+sudo apt-get -y install cuda-drivers
+sudo apt-get -y install nvidia-container-runtime