My Canonical k8s cluster broke on every reboot, here's why

I wanted to share this debugging story in case anyone else runs into the same issue, but I didn’t really feel like manually writing a post about it, since this thing took me a loooong time to debug. Since I debugged the problem together with AI (GitHub Copilot, in the terminal), I just asked it to write this post for me: I definitely don’t like AI-written blog posts, but please accept this exception in the name of “it could be useful to someone with the same problem”.

TL;DR: Canonical k8s and docker don’t play nice together.

This problem was encountered on Ubuntu 26.04 and with k8s from 1.32-classic/stable.

Enjoy :duck:


The problem

After every reboot, sudo k8s status returned:

dial tcp 127.0.0.1:6443: connect: connection refused

The Kubernetes API server was not running.

Initial diagnosis

I checked the snap services:

$ sudo snap services k8s
k8s.containerd               enabled   inactive
k8s.kube-apiserver           enabled   inactive
k8s.kubelet                  enabled   inactive

All components were inactive. The common dependency is containerd. If containerd doesn’t start, nothing else does.

The containerd journal showed:

level=warning msg="waiting for response from boltdb open" plugin=bolt
snap.k8s.containerd.service: start operation timed out. Terminating.

Containerd was hanging indefinitely while trying to open its BoltDB metadata database, until systemd killed it after 5 minutes.

Root cause

The database file (meta.db) was 4MB — too small to cause a legitimate timeout. BoltDB acquires an exclusive file lock (flock) on open. A hang means another process already holds that lock.

I checked:

$ sudo fuser -v /var/lib/containerd/io.containerd.metadata.v1.bolt/meta.db
  root   2724 F...m containerd    ← /usr/bin/containerd (system, used by Docker)
  root 634950 F.... containerd    ← /snap/k8s/.../containerd (k8s snap)

Two containerd instances, from two separate installations (Docker’s and the k8s snap’s), were both configured with --root=/var/lib/containerd. The system containerd started first on boot and acquired the exclusive lock. The snap’s containerd started second and blocked on flock, waiting forever.

This is why the issue occurred on every reboot: Docker’s containerd always won the race.

Fix

Separate the two containerd instances by giving Docker’s containerd its own directories:

# Create a separate config for Docker's containerd
sudo mkdir -p /etc/containerd-docker
sudo /usr/bin/containerd config default | sudo tee /etc/containerd-docker/config.toml > /dev/null
sudo sed -i "s|root = '/var/lib/containerd'|root = '/var/lib/containerd-docker'|" /etc/containerd-docker/config.toml
sudo sed -i "s|state = '/run/containerd'|state = '/run/containerd-docker'|" /etc/containerd-docker/config.toml

# Override system containerd to use the new config
sudo mkdir -p /etc/systemd/system/containerd.service.d
cat <<'EOF' | sudo tee /etc/systemd/system/containerd.service.d/override.conf
[Service]
ExecStart=
ExecStart=/usr/bin/containerd --config /etc/containerd-docker/config.toml
EOF

# Override Docker to use the new containerd socket
sudo mkdir -p /etc/systemd/system/docker.service.d
cat <<'EOF' | sudo tee /etc/systemd/system/docker.service.d/override.conf
[Service]
ExecStart=
ExecStart=/usr/bin/dockerd -H fd:// --containerd=/run/containerd-docker/containerd.sock
EOF

sudo systemctl daemon-reload
sudo systemctl restart containerd docker
sudo snap start k8s

After this, the k8s snap’s containerd owns /var/lib/containerd, Docker’s containerd uses /var/lib/containerd-docker, and there is no lock contention. The cluster survives reboots.

this story actually reminds me a lot of a similarly hellish bug I recently found in workshop

Depending on your preference, you can also tell Canonical K8s to use a different directory at bootstrap time :slightly_smiling_face:

https://documentation.ubuntu.com/canonical-kubernetes/latest/snap/howto/install/dev-env/#conflicts