Cant deploy - controller seems no to like certificates

I have a multi-cloud controller (dwellir-sodertalje) running 4 remote lxd clouds with multiple users.

It was working perfectly fine until yesterday. However, now we’ve started having issues:

  • Controller dwellir-sodertalje 2.9.35 seems unable to deploy or remove machines.

  • The following sequence renders the error:

  1. juju add-model foobar dwellir
  2. juju deploy tiny-bash

tail -f /var/log/juju/models/admin-foobar-d84e4b.log

2023-01-13 12:55:21 INFO juju.worker.provisioner provisioner_task.go:1348 trying machine 0 StartInstance in availability zone dwellir1
2023-01-13 12:55:21 WARNING juju.worker.provisioner provisioner_task.go:1363 machine 0 failed to start in availability zone dwellir1: Post "https://192.168.111.2:8443/1.0/instances?project=default": x509: certificate is valid for 127.0.0.1, ::1, not 192.168.111.2
2023-01-13 12:55:21 WARNING juju.worker.provisioner provisioner_task.go:1405 failed to start machine 0 (Post "https://192.168.111.2:8443/1.0/instances?project=default": x509: certificate is valid for 127.0.0.1, ::1, not 192.168.111.2), retrying in 10s (10 more attempts)
2023-01-13 12:55:31 INFO juju.worker.provisioner provisioner_task.go:1348 trying machine 0 StartInstance in availability zone dwellir1
2023-01-13 12:55:31 WARNING juju.worker.provisioner provisioner_task.go:1363 machine 0 failed to start in availability zone dwellir1: Get "https://192.168.111.2:8443/1.0/images/aliases/juju%2Fbionic%2Famd64?project=default": x509: certificate is valid for 127.0.0.1, ::1, not 192.168.111.2
2023-01-13 12:55:31 WARNING juju.worker.provisioner provisioner_task.go:1405 failed to start machine 0 (Get "https://192.168.111.2:8443/1.0/images/aliases/juju%2Fbionic%2Famd64?project=default": x509: certificate is valid for 127.0.0.1, ::1, not 192.168.111.2), retrying in 10s (9 more attempts)

If I bootstrap a new controller - it all works using that new controller, so it seems that LXD is all fine and I can use lxd without any issues.

I have had this issue before and @tlm helped us look at this, and it seemed to be resolved by us adding and removing certificates. But we don’t know exactly what we did in the end to resolve it then. This issue has been an ongoing situation and we really need help resolving this as we are now suffering from not being able to work with the controller.

We are prepared to migrate into a new controller if needed, but we need assistance doing such a move as this is over our heads to resolve.

@tmihoc can you help us ping in someone who could pull us lose?

@tlm, @hmlanigan , @manadart ?

When I inspect the server.crt for the LXD host, I can see that the certificate is indeed valid for:

root@dwellir5:~# openssl x509 -noout -text -in /var/snap/lxd/common/lxd/server.crt

X509v3 Subject Alternative Name: DNS:dwellir5, IP Address:127.0.0.1, IP Address:0:0:0:0:0:0:0:1

Perhaps this is a lead for what goes on here? @stgraber

Running locally the command from my remote client towards 192.168.111.4 (dwellir5) works just fine:

lxc list dwellir5:

So it seems the lxc client on my local client is all good with the server.crt whereas the juju controller doesn’t like the same?

I have read here https://github.com/lxc/lxd/issues/10286 and compared my server.crt with that of

root@dwellir5:~# lxc query /1.0 | jq -r .environment.certificate

… and they are the same.

Would I need to re-create server certs to reflect the lxd-remote IP:s?

Hi @erik-lonroth and happy new year.

From my memory the lxd client code does not use the ip addresses baked inside of the certificate as part of the client to server connection validation. However when it receives a certificate that it wasn’t expecting then it performs this check.

I think the best place to start with this issue is to look at how you made the credentials for the cloud that Juju is using. Would you please provide as much detail as possible about how you package the lxd credentials and uploaded them to Juju for the model? Information on where you obtained the certificate files from for the user and the server and what key values they were plugged into will be very helpful.

This is the best starting point to figure out the problem and we can arrange a phone call from there if need be.

Ta Tom

1 Like

The credentials was added in 3 different ways.

  1. Extraced the lxd server.crt
  2. Extracted client.crt and client.key
  3. Manually added to credentials.yaml
  4. Ran juju add-credentials

Or

  1. Added the lxd remote with “lxc remote add…”
  2. juju add-cloud
  3. Ran juju autoload-credentials

Or

  1. juju add-credentials

A combination of these methods above has been used during the two years we had them. Since we don’t fully know which is better and we don’t have a good strategy for managing credentials. Also, different users might use different methods.

We don’t even know WHICH certificate is referenced. LXD cert? Controller certs? User certs? There are so many and juju doesn’t give any clues.

Hey @erik-lonroth,

I have made the following bug to track this better Bug #2003135 “x509 Certificate Validation For LXD Clouds and Cre...” : Bugs : juju .

As stated in our call last night I will do some investigation into the lxd client code and figure out under what circumstances it does not skip ip address checks in the certificate.

tlm

1 Like

I’m glad there is a bug and just let us know what we can do to support in the debugging session.

I have updated the bug with information @tlm as the problem remains also after upgrade from 2.9.37 -> 2.9.38.