Juju status is taking several minutes

gbeuzeboc · 10 February 2025 16:51

I have a Juju k8s deployment of COS Lite that seems to be very slow despite the fact that it’s just deployed and handling no data.

Even calling juju status takes several minutes to return.

$ juju --debug status
17:34:53 INFO  juju.cmd supercommand.go:56 running juju [3.5.5 1d21840563a55809580976df0b4880e03323bbb7 gc go1.23.3]
17:34:53 DEBUG juju.cmd supercommand.go:57   args: []string{"/snap/juju/29058/bin/juju", "--debug", "status"}
17:34:53 INFO  juju.juju api.go:86 connecting to API addresses: [10.152.183.149:17070]
17:35:07 INFO  juju.kubernetes.klog klog.go:113 Use tokens from the TokenRequest API or manually created secret-based tokens instead of auto-generated secret-based tokens.%!(EXTRA []interface {}=[])
17:37:52 DEBUG juju.api apiclient.go:509 starting proxier for connection
17:37:53 DEBUG juju.api apiclient.go:513 tunnel proxy in use at localhost on port 44243
17:37:53 DEBUG juju.api apiclient.go:689 looked up localhost -> [::1 127.0.0.1]
17:38:24 DEBUG juju.api apiclient.go:1036 successfully dialed "wss://localhost:44243/model/c46c85e4-92f2-460e-83df-67b4798d9f37/api"
17:38:24 DEBUG juju.api apiclient.go:1036 successfully dialed "wss://localhost:44243/model/c46c85e4-92f2-460e-83df-67b4798d9f37/api"
17:38:24 INFO  juju.api apiclient.go:571 connection established to "wss://localhost:44243/model/c46c85e4-92f2-460e-83df-67b4798d9f37/api"
Model               Controller          Cloud/Region        Version  SLA          Timestamp
cos-robotics-model  rob-cos-controller  microk8s/localhost  3.5.4    unsupported  17:38:26+01:00

App           Version  Status   Scale  Charm             Channel        Rev  Address         Exposed  Message
alertmanager  0.27.0   waiting      1  alertmanager-k8s  latest/stable  138  10.152.183.176  no       waiting for container
catalogue              blocked      1  catalogue-k8s     latest/stable   75  10.152.183.186  no       ERROR cannot ensure service account "unit-catalogue-0": Internal error occurred: resource quota evaluation timed out

grafana             9.5.3   maintenance  0/1  grafana-k8s             latest/stable  126  10.152.183.92   no
loki                2.9.6   waiting      1    loki-k8s                latest/stable  181  10.152.183.222  no  waiting for container
prometheus          2.52.0  maintenance  0/1  prometheus-k8s          latest/stable  221  10.152.183.142  no
traefik                     waiting      1    traefik-k8s             latest/stable  223  10.152.183.199  no  waiting for units to settle down

Unit             Workload  Agent      Address      Ports  Message
alertmanager/0*  active    executing  10.1.161.50
catalogue/0*     blocked   executing  10.1.161.12         ERROR cannot ensure service account "unit-catalogue-0": Internal error occurred: resource quota evaluation timed out

grafana/0*            unknown  lost       10.1.161.7     agent lost, see 'juju show-status-log grafana/0'
loki/0*               active   executing  10.1.161.19    (start)
prometheus/0*         unknown  lost       10.1.161.23    agent lost, see 'juju show-status-log prometheus/0'
traefik/0*            active   executing  10.1.161.61    (start) Serving at 10.157.61.68
17:38:27 DEBUG juju.api monitor.go:35 RPC connection died
17:38:27 INFO  cmd supercommand.go:556 command finished

Is there anything I could check in my deployment to speed things up?

The machine has 4 cores + 8GB of RAM and nothing seems to be at 100%.

juju version → 3.5.5-genericlinux-amd64

MicroK8s v1.31.5 revision 7593

gbeuzeboc · 11 February 2025 11:21

I started in parallel the exact same VM and the fresh VM is much more responsive.

$ juju --debug status
12:19:50 INFO  juju.cmd supercommand.go:56 running juju [3.6.2 87cae7505aee356eda90d98ae345e1c11eb26c72 gc go1.23.4]
12:19:50 DEBUG juju.cmd supercommand.go:57   args: []string{"/snap/juju/29493/bin/juju", "--debug", "status"}
12:19:50 INFO  juju.juju api.go:86 connecting to API addresses: [10.152.183.92:17070]
12:19:50 INFO  juju.kubernetes.klog klog.go:113 Use tokens from the TokenRequest API or manually created secret-based tokens instead of auto-generated secret-based tokens.%!(EXTRA []interface {}=[])
12:19:50 DEBUG juju.api apiclient.go:508 starting proxier for connection
12:19:50 DEBUG juju.api apiclient.go:512 tunnel proxy in use at localhost on port 44069
12:19:50 DEBUG juju.api apiclient.go:1035 successfully dialed "wss://localhost:44069/model/672b5425-1660-435f-8d33-646f863f7f69/api"
12:19:50 INFO  juju.api apiclient.go:570 connection established to "wss://localhost:44069/model/672b5425-1660-435f-8d33-646f863f7f69/api"
Model           Controller           Cloud/Region        Version  SLA          Timestamp
microk8s-model  microk8s-controller  microk8s/localhost  3.6.2    unsupported  12:19:50+01:00

App           Version  Status  Scale  Charm             Channel        Rev  Address         Exposed  Message
alertmanager  0.27.0   active      1  alertmanager-k8s  latest/stable  144  10.152.183.148  no
catalogue              active      1  catalogue-k8s     latest/stable   79  10.152.183.124  no
grafana       9.5.3    active      1  grafana-k8s       latest/stable  128  10.152.183.230  no
loki          2.9.6    active      1  loki-k8s          latest/stable  184  10.152.183.218  no
prometheus    2.52.0   active      1  prometheus-k8s    latest/stable  226  10.152.183.233  no
traefik       2.11.0   active      1  traefik-k8s       latest/stable  226  10.152.183.211  no       Serving at 10.157.61.149

Unit             Workload  Agent      Address      Ports  Message
alertmanager/0*  active    idle       10.1.32.130
catalogue/0*     active    idle       10.1.32.181
grafana/0*       active    executing  10.1.32.132
loki/0*          active    executing  10.1.32.131
prometheus/0*    active    executing  10.1.32.134
traefik/0*       active    idle       10.1.32.129         Serving at 10.157.61.149
12:19:50 DEBUG juju.api monitor.go:35 RPC connection died
12:19:50 INFO  cmd supercommand.go:556 command finished

It appears that over time my previous VM got slowed down. Is there anything that accumulate or could slow down over time a juju k8s deployment?

aflynn · 12 February 2025 15:35

Hey @gbeuzeboc, I’m not aware of anything in a deployment like this that would cause a slowdown such as the one you’re seeing. 8GB of RAM seems quite low to me for a deployment of this scale, do you see the same issues in a VM of 16GB? Same applies for the processor, it is also fairly low powered for a deployment of this size.

When status is running, what is at the top of htop?

gbeuzeboc · 12 February 2025 15:56

A fresh VM with the exact same deployment runs fine and juju is responsive. I suspect more that something accumulates (I am removing an application and redeploying it for tests).

When running juju status (or even when not running it) the main process is: kubelite with an average of 40% CPU and 1% MEM.

danielarndt · 12 February 2025 17:38

I’ve had a similar issue to this in the past after suspending a VM. I don’t think I ever tracked down exactly what the cause was, but if I remember correctly a qemu process was very busy.

ppasotti · 14 February 2025 07:06

how are you removing? with force? because if that’s the case, it may be that things are not really being cleaned up in the background. perhaps worth digging through lxc list, kubectl, and ps -aux to look for zombie VMs, pods, and processes.

gbeuzeboc · 14 February 2025 16:28

I am removing with juju remove-application app-name and not using --force.

I have been suspending my computer with the running VM a lot and also suspending the VM independently. The VM itself is responsive, it’s just juju related commands that are all slowed down.

It seems that the issue resides in microk8s. Any microk8s command seems timeout most of the time:

time microk8s.kubectl get pods -A
E0214 17:20:13.798554 1425426 memcache.go:265] "Unhandled Error" err="couldn't get current server API group list: Get \"https://127.0.0.1:16443/api?timeout=32s\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
E0214 17:21:16.890549 1425426 memcache.go:265] "Unhandled Error" err="couldn't get current server API group list: Get \"https://127.0.0.1:16443/api?timeout=32s\": context deadline exceeded"
I0214 17:21:28.176278 1425426 request.go:700] Waited for 1.803346428s due to client-side throttling, not priority and fairness, request: GET:https://127.0.0.1:16443/api?timeout=32s
E0214 17:21:53.324949 1425426 memcache.go:265] "Unhandled Error" err="couldn't get current server API group list: Get \"https://127.0.0.1:16443/api?timeout=32s\": net/http: TLS handshake timeout"
E0214 17:22:04.425500 1425426 memcache.go:265] "Unhandled Error" err="couldn't get current server API group list: Get \"https://127.0.0.1:16443/api?timeout=32s\": net/http: TLS handshake timeout"
E0214 17:22:21.702491 1425426 memcache.go:265] "Unhandled Error" err="couldn't get current server API group list: Get \"https://127.0.0.1:16443/api?timeout=32s\": net/http: TLS handshake timeout"
Unable to connect to the server: net/http: TLS handshake timeout

real    9m36.305s
user    0m1.663s
sys     2m12.600s

The juju models command seems to list way fewer models than the microk8s.kubectl get pods -A command. I have a lot of kube-system hostpath-provisioner-cos-* marked completed but present. I am not sure what they are and if I should delete them.

danielarndt · 14 February 2025 17:53

This all sounds very similar to what I was experiencing. Same issues with the VM being fine, but microk8s taking forever to respond.

By any chance are you using multipass for your VM? This all seemed to disappear when I started to use lxc/lxd instead of multipass.

gbeuzeboc · 15 February 2025 15:55

I am indeed using Multipass.

Do you also mean LXD VM or LXD containers only? Juju with k8s works well inside an LXD container?

danielarndt · 18 February 2025 11:52

LXD VM. I’m not positive that is what made it go away, but I haven’t noticed it since.