Juju controller complains about lxd certificates

I’m having issues with a juju controller suddenly getting problems connecting to 2 separate lxd hosts.

It complains alot about certificates (/var/log/juju/logsink.log):

11ca14fd-f66a-43aa-8644-06f7f70de9a0: machine-0 2022-03-16 22:27:18 ERROR juju.provider.lxd environ_instance.go:35 failed to get instances from LXD: Get "https://192.168.111.4:8443/1.0/containers?project=default&recursion=1": x509: certificate is valid for 127.0.0.1, ::1, not 192.168.111.4 
11ca14fd-f66a-43aa-8644-06f7f70de9a0: machine-0 2022-03-16 22:27:18 ERROR juju.worker.dependency engine.go:693 "instance-poller" manifold worker returned unexpected error: Get "https://192.168.111.4:8443/1.0/containers?project=default&recursion=1": x509: certificate is valid for 127.0.0.1, ::1, not 192.168.111.4 
11ca14fd-f66a-43aa-8644-06f7f70de9a0: machine-0 2022-03-16 22:27:20 ERROR juju.provider.lxd environ_instance.go:35 failed to get instances from LXD: Get "https://192.168.111.4:8443/1.0/containers?project=default&recursion=1": x509: certificate is valid for 127.0.0.1, ::1, not 192.168.111.4 
11ca14fd-f66a-43aa-8644-06f7f70de9a0: machine-0 2022-03-16 22:27:20 ERROR juju.worker.dependency engine.go:693 "instance-poller" manifold worker returned unexpected error: Get "https://192.168.111.4:8443/1.0/containers?project=default&recursion=1": x509: certificate is valid for 127.0.0.1, ::1, not 192.168.111.4 
409902da-6d5f-4649-83ea-61bc27b7ad5b: machine-0 2022-03-16 22:27:24 ERROR juju.worker.dependency engine.go:693 "broker-tracker" manifold worker returned unexpected error: cannot load machine machine-0 from state: not authorized 

The machine-0 above is the controller machine.

The 192.168.111.4 is the first lxd host and the second is 192.168.111.2. They are 2 separate clouds on the same controller.

I can add models, but when I add machines, this error shows up in juju status:

failed to start machine 0 (Get "https://192.168.111.4:8443/1.0/images/aliases/juju%2Ffocal%2Famd64?project=default": x509: certificate is valid for 127.0.0.1, ::1, not 192.168.111.4), retrying in 10s (10 more attempts)

I fear that the lxd snap in the lxd hosts has replaced its server certificate, but I really don’t know.

Can this be also related to credential issues from “juju update-credentials” at some stage?

Really need some help here.

This is some more details:

$ juju update-credential dwellir5
This operation can be applied to both a copy on this client and to the one on a controller.
Do you want to update credential "" on cloud "dwellir5" on:
    1. client only (--client)
    2. controller "dwellir-sodertalje" only (--controller dwellir-sodertalje)
    3. both (--client --controller dwellir-sodertalje)
Enter your choice, or type Q|q to quit: 2
Credential valid for:
  foo1
Credential invalid for:
  rpc-5:
    Get "https://192.168.111.4:8443/1.0": x509: certificate is valid for 127.0.0.1, ::1, not 192.168.111.4
Failed models may require a different credential.
Use ‘juju set-credential’ to change credential for these models before repeating this update.

… so I did some more “juju update-credentials” and got rid of the above message (for a while?).

I managed to do “juju add-model foo3 dwellir3/sodertalje” and even managed to deploy tiny-bash, but now I’m back at this after trying to “juju add-machine”.

So, I then re-run “juju update-credential” again, at which point the model comes alive.

… and then the credential again becomes invalid. (Suspended since cloud credential is not valid)

Its just like the controller “forgets” about my credential?

1 Like

Hey @erik-lonroth,

Thanks for bringing this to our attention. I have done a fair amount of investigation from our side and it does appear to be an issue with a recent lxd upgrade. One of my local testing lxd vm’s looks to be having a similar problem.

Can see the certificate doesn’t look correct with openssl s_client

I am going to get some more information and talk to the LXD folks. In the meantime would you mind confirming for me the version of lxd in both clouds?

I will be back soon with some further information for you. Please feel free to ping me on chat.charmhub.io if you want to have a quicker conversation.

Ta tlm

1 Like

Yes, please. @joakimnyman is also hit by this and our operations are down.

How do I test this in our controller?

@stgraber This appears to be correlated with the Snap update to LXD 4.24.

Any insights as to what might explain this behaviour?

I checked the LXD versions while I was with Eric and he is not on latest LXD.

1 Like

Not aware of any LXD 4.24 change that would explain the error,

This kind of error normally suggests that you’re not passing a server certificate on connection or are not passing the correct one at least. This can also happen if you’re somehow talking to the wrong LXD (IP changed), unfortunately Go TLS isn’t particularly great at showing a relevant error.

1 Like

I had a shared session with @tlm and supplied as much information I could.

It does seem like (to me) that the lxd host seems to respond correctly, since I can use the lxc client with my original certs and the lxd host happily complies.

It makes no sense whatsoever to me why the juju controller is barfing about the remote lxd. Also, I’m not capable/competent enough of debugging it so I’m super greatful for the assistance from @tlm and @stgraber.

Also @joakimnyman is capable of assisting as he is hit by the same issue.