juju is stuck after restarting AKS

I’m building kubeflow with Juju. Here’s how I am building it:

az aks create --name kubeflow \
        --resource-group xxx \
        --location northeurope \
        --nodepool-name ctrlpool --node-vm-size Standard_E2ds_v5 \
        --enable-cluster-autoscaler --max-count 6 --min-count 1 \

az aks get-credentials --resource-group xxx --name kubeflow --overwrite-existing

juju add-k8s myk8s
juju bootstrap myk8s kubeflow-controller
juju add-model kubeflow
juju deploy kubeflow --channel 1.8/stable --trust

IP=""

while [ -z $IP ]; do
    IP=$(kubectl -n kubeflow get svc istio-ingressgateway-workload -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
    sleep 5
done

juju config dex-auth public-url="http://${IP}.nip.io"
juju config oidc-gatekeeper public-url="http://${IP}.nip.io"

juju config dex-auth static-username="admin"
juju config dex-auth static-password="admin"

All is working well until I restart my AKS (or doing nothing for like 20minutes), then I can’t do anything except juju controllers

When I’m looking at the controller logs I get:

You appear to have defaulted to the charm container on the last logs screenshot. It shows that you can’t connect to the controller API.

You want to see what the api-server container is logging to see why it isn’t up.

I have a similar issue after doing just microk8s stop/start.

$ k logs -n controller-microk8s-localhost pod/controller-0 -c api-server

 [jujud] 2024-02-05 14:55:42 ERROR juju.worker.dependency engine.go:695 "lease-manager" manifold worker returned unexpected error: catacomb 0xc0006f3da8 is dying
 [jujud] 2024-02-05 14:55:42 ERROR juju.worker.dependency engine.go:695 "db-accessor" manifold worker returned unexpected error: listen to 10.1.125.135:17666: listen tcp 10.1.125.135:17666: bind: cannot assign requested address
 [jujud] 2024-02-05 14:55:42 ERROR juju.worker.dependency engine.go:695 "api-caller" manifold worker returned unexpected error: [0f6f54] "controller-0" cannot open api: unable to connect to API: dial tcp 127.0.0.1:17070: connect: connection refused
....
 [jujud] 2024-02-05 15:03:28 ERROR juju.worker.dependency engine.go:695 "api-caller" manifold worker returned unexpected error: [0f6f54] "controller-0" cannot open api: unable to connect to API: dial tcp 127.0.0.1:17070: connect: connection refused
 [jujud] 2024-02-05 15:03:28 INFO juju.worker.dbaccessor worker.go:737 shutting down Dqlite node
 [jujud] 2024-02-05 15:03:28 ERROR juju.worker.dependency engine.go:695 "lease-manager" manifold worker returned unexpected error: catacomb 0xc000978728 is dying
 [jujud] 2024-02-05 15:03:28 ERROR juju.worker.dependency engine.go:695 "lease-expiry" manifold worker returned unexpected error: catacomb 0xc000978728 is dying
 [jujud] 2024-02-05 15:03:28 ERROR juju.worker.dependency engine.go:695 "db-accessor" manifold worker returned unexpected error: listen to **10.1.125.135:17666:** listen tcp 10.1.125.135:17666: bind: cannot assign requested address

UPD… should be addressed with this fix: https://github.com/juju/juju/pull/16863

1 Like