I had to tear down my k8s install on a local MAAS cloud and redo it yesterday. MAAS sets up a Ubuntu 20.04 cloud.
When I deploy charmed kubernetes, I see that on the gpu node, “containerd/0” is in an error state because NVidia driver installation fails. The containerd/0 logs on that machine show the following.
2021-08-27 15:39:53 INFO unit.containerd/0.juju-log server.go:314 status-set: maintenance: Installing Nvidia drivers.
2021-08-27 15:39:56 WARNING unit.containerd/0.install logger.go:60 E: Failed to fetch https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/by-hash/SHA256/751939d95516afc289908a19e447f0acc1506367f72ed356431a2b1a469cc8ca 404 Not Found [IP: 192.168.23.1 8000]
2021-08-27 15:39:56 WARNING unit.containerd/0.install logger.go:60 E: Some index files failed to download. They have been ignored, or old ones used instead.
2021-08-27 15:39:56 INFO unit.containerd/0.juju-log server.go:314 Installing ['cuda-drivers', 'nvidia-container-runtime'] with options: ['--option=Dpkg::Options::=--force-confold']
2021-08-27 15:39:57 WARNING unit.containerd/0.install logger.go:60 E: Unable to locate package cuda-drivers
2021-08-27 15:39:57 INFO unit.containerd/0.juju-log server.go:314 Couldn't acquire DPKG lock. Will retry in 10 seconds
2021-08-27 15:40:07 WARNING unit.containerd/0.install logger.go:60 E: Unable to locate package cuda-drivers
2021-08-27 15:40:07 INFO unit.containerd/0.juju-log server.go:314 Couldn't acquire DPKG lock. Will retry in 10 seconds
2021-08-27 15:40:17 WARNING unit.containerd/0.install logger.go:60 E: Unable to locate package cuda-drivers
2021-08-27 15:40:17 INFO unit.containerd/0.juju-log server.go:314 Couldn't acquire DPKG lock. Will retry in 10 seconds
2021-08-27 15:40:27 WARNING unit.containerd/0.install logger.go:60 E: Unable to locate package cuda-drivers
2021-08-27 15:40:27 ERROR unit.containerd/0.juju-log server.go:314 Hook error:
Traceback (most recent call last):
File "/var/lib/juju/agents/unit-containerd-0/.venv/lib/python3.8/site-packages/charms/reactive/__init__.py", line 74, in main
bus.dispatch(restricted=restricted_mode)
File "/var/lib/juju/agents/unit-containerd-0/.venv/lib/python3.8/site-packages/charms/reactive/bus.py", line 390, in dispatch
_invoke(other_handlers)
File "/var/lib/juju/agents/unit-containerd-0/.venv/lib/python3.8/site-packages/charms/reactive/bus.py", line 359, in _invoke
handler.invoke()
File "/var/lib/juju/agents/unit-containerd-0/.venv/lib/python3.8/site-packages/charms/reactive/bus.py", line 181, in invoke
self._action(*args)
File "/var/lib/juju/agents/unit-containerd-0/charm/reactive/containerd.py", line 389, in configure_nvidia
apt_install(NVIDIA_PACKAGES, fatal=True)
File "/var/lib/juju/agents/unit-containerd-0/.venv/lib/python3.8/site-packages/charmhelpers/fetch/ubuntu.py", line 303, in apt_install
_run_apt_command(cmd, fatal, quiet=quiet)
File "/var/lib/juju/agents/unit-containerd-0/.venv/lib/python3.8/site-packages/charmhelpers/fetch/ubuntu.py", line 804, in _run_apt_command
_run_with_retries(
File "/var/lib/juju/agents/unit-containerd-0/.venv/lib/python3.8/site-packages/charmhelpers/fetch/ubuntu.py", line 781, in _run_with_retries
result = subprocess.check_call(cmd, env=env, **kwargs)
File "/usr/lib/python3.8/subprocess.py", line 364, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['apt-get', '--assume-yes', '--option=Dpkg::Options::=--force-confold', 'install', 'cuda-drivers', 'nvidia-container-runtime']' returned non-zero exit status 100.
If I check that 404 URL: https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/by-hash/SHA256/751939d95516afc289908a19e447f0acc1506367f72ed356431a2b1a469cc8ca
manually, it is indeed a 404.
The error looks to be in the apt eco system. I am looking at the reactive hook for containerd to see if I can do something manually. In the meantime, I would appreciate any advice in how to move past this