Restarting Microk8s charmed-kubeflow cluster

ollienuk · 23 September 2022 10:47

I have a 6 node Microk8s cluster with charmed kubeflow deployed.

I performed some network maintenance which meant the VMs with Microk8s were briefly disconnected for around 5 minutes from the network. After this they had some connectivity issues so I restarted the nodes. After restarting all 6 nodes Microk8s comes up succesfully and forms a quorum yet everything deployed in the Kubeflow namespace has an Unknown state. I’ve seen some old articles from 2019 where users where having a similar issue but it was patched so keen to see if anyone has any suggestions.

Please find my inspection report here: https://plik.root.gg/file/ldNHKrHbG2nTXnEV/q3CxtQdiM8iv0QrV/inspection-report-20220923_092917.tar.gz

kos.tsakalozos · 23 September 2022 15:05

Hi @ollienuk

In the log report I see that the performance of the disk is not great and this causes the k8s services to give up. See for example:

Sep 23 09:25:51 gpuworker-1 microk8s.daemon-kubelite[1475]: Trace[562495236]: ---"Object stored in database" 8120ms (09:25:51.360)
Sep 23 09:25:51 gpuworker-1 microk8s.daemon-kubelite[1475]: Trace[562495236]: [8.120782622s] [8.120782622s] END
Sep 23 09:25:51 gpuworker-1 microk8s.daemon-kubelite[1475]: F0923 09:25:51.363899    1475 server.go:213] leaderelection lost
Sep 23 09:25:51 gpuworker-1 systemd[1]: snap.microk8s.daemon-kubelite.service: Main process exited, code=exited, status=1/FAILURE
Sep 23 09:25:51 gpuworker-1 systemd[1]: snap.microk8s.daemon-kubelite.service: Failed with result 'exit-code'.
Sep 23 09:25:51 gpuworker-1 systemd[1]: snap.microk8s.daemon-kubelite.service: Scheduled restart job, restart counter is at 1.

It took k8s 8 seconds to persist something on the datastore. If the performance of the storage substrate is acceptable, is it possible there are I/O heavy workloads present?

ollienuk · 23 September 2022 15:18

Hi,

There appear to be no performance issues on the VM’s disk: ubuntu@gpuworker-1:~$ dd if=/dev/zero of=./test.img bs=1G count=1 oflag=dsync 1+0 records in 1+0 records out 1073741824 bytes (1.1 GB, 1.0 GiB) copied, 2.48401 s, 432 MB/s

ollienuk · 23 September 2022 16:12

@kos.tsakalozos After another restart the Juju pods are also in an Unknown state:

ollienuk · 26 September 2022 09:47

Over the weekend Microk8s has completely ground to a hault and isn’t running even when restarted: https://plik.root.gg/file/6aesfRTSZ1BLeKOZ/AaJcEe8Kq0Vga8Ox/inspection-report-20220926_094527.tar.gz