I have a multi-cloud controller (dwellir-sodertalje) running 4 remote lxd clouds with multiple users.
It was working perfectly fine until yesterday. However, now we’ve started having issues:
-
Controller dwellir-sodertalje 2.9.35 seems unable to deploy or remove machines.
-
The following sequence renders the error:
- juju add-model foobar dwellir
- juju deploy tiny-bash
tail -f /var/log/juju/models/admin-foobar-d84e4b.log
2023-01-13 12:55:21 INFO juju.worker.provisioner provisioner_task.go:1348 trying machine 0 StartInstance in availability zone dwellir1
2023-01-13 12:55:21 WARNING juju.worker.provisioner provisioner_task.go:1363 machine 0 failed to start in availability zone dwellir1: Post "https://192.168.111.2:8443/1.0/instances?project=default": x509: certificate is valid for 127.0.0.1, ::1, not 192.168.111.2
2023-01-13 12:55:21 WARNING juju.worker.provisioner provisioner_task.go:1405 failed to start machine 0 (Post "https://192.168.111.2:8443/1.0/instances?project=default": x509: certificate is valid for 127.0.0.1, ::1, not 192.168.111.2), retrying in 10s (10 more attempts)
2023-01-13 12:55:31 INFO juju.worker.provisioner provisioner_task.go:1348 trying machine 0 StartInstance in availability zone dwellir1
2023-01-13 12:55:31 WARNING juju.worker.provisioner provisioner_task.go:1363 machine 0 failed to start in availability zone dwellir1: Get "https://192.168.111.2:8443/1.0/images/aliases/juju%2Fbionic%2Famd64?project=default": x509: certificate is valid for 127.0.0.1, ::1, not 192.168.111.2
2023-01-13 12:55:31 WARNING juju.worker.provisioner provisioner_task.go:1405 failed to start machine 0 (Get "https://192.168.111.2:8443/1.0/images/aliases/juju%2Fbionic%2Famd64?project=default": x509: certificate is valid for 127.0.0.1, ::1, not 192.168.111.2), retrying in 10s (9 more attempts)
If I bootstrap a new controller - it all works using that new controller, so it seems that LXD is all fine and I can use lxd without any issues.
I have had this issue before and @tlm helped us look at this, and it seemed to be resolved by us adding and removing certificates. But we don’t know exactly what we did in the end to resolve it then. This issue has been an ongoing situation and we really need help resolving this as we are now suffering from not being able to work with the controller.
We are prepared to migrate into a new controller if needed, but we need assistance doing such a move as this is over our heads to resolve.
@tmihoc can you help us ping in someone who could pull us lose?