From my memory the lxd client code does not use the ip addresses baked inside of the certificate as part of the client to server connection validation. However when it receives a certificate that it wasn’t expecting then it performs this check.
I think the best place to start with this issue is to look at how you made the credentials for the cloud that Juju is using. Would you please provide as much detail as possible about how you package the lxd credentials and uploaded them to Juju for the model? Information on where you obtained the certificate files from for the user and the server and what key values they were plugged into will be very helpful.
This is the best starting point to figure out the problem and we can arrange a phone call from there if need be.
A combination of these methods above has been used during the two years we had them. Since we don’t fully know which is better and we don’t have a good strategy for managing credentials. Also, different users might use different methods.
We don’t even know WHICH certificate is referenced. LXD cert? Controller certs? User certs? There are so many and juju doesn’t give any clues.
As stated in our call last night I will do some investigation into the lxd client code and figure out under what circumstances it does not skip ip address checks in the certificate.
I will add my two cents in juju 3.2. The only way that was successful for me was to add lxd remote and then run juju autoload-credentials.
I observed a weird behaviour(in my case I am using a 3 node LXD cluster). When I added my lxd cloud using the traditional juju add-cloud I observed that for some interesting reason juju was adding the /var/snap/lxd/common/lxd/server.crt from the vm that the juju snap is running from(for the sake of explanation let us call it juju-client) in spite of getting different IP addresses during initialization. The machine has also lxd installed, so juju found a localhost lxd cloud and is using it’s server.crt for every new lxd cloud added by juju add-cloud.
I tried to manually remove credentials using juju remove-cloud but it did not resolve the problem, as juju kept adding the server cert from the local lxd snap instead of using the address provided during adding proccess.
When i on the other hand used lxd remote add and added my cluster(using second approach described above), then ran juju autoload-credentials it took /var/snap/lxd/common/lxd/cluster.crt and copied(correctly) the client certs from /var/snap/lxd/common/lxc of the local lxd snap and cluster.crt from one of my cluster members, probably the one whose address I specified during lxd remote add.
Of course one could argue that you can just copy the certs from one section of ~/.local/share/juju/credentials.yaml but IMHO it should not be like that.
I will have a quick dig today based on the information provided. @hypeitnow I may need some more information from you as not everything above makes a lot of sense but I will see how I go first.
I have had a look at this today from both the snap and a fresh build of Juju and I can’t replicate what is being talked about from your end.
Adding a remote LXD cloud to Juju never pulls in the local LXD server certificate both for interactive and certificate methods.
Would you be able to provide more information on what you are doing and seeing. If you would like to share your credentials.yaml file as well we can take a look. But more importantly the steps to reproduce from your end will help a lot.
You can use the unix script command to record your Juju session. Please wipe and private keys from the files and or you can submit the data to myself on Mattermost as well.
We have the same behavior as @hypeitnow where it would upload the clients LXD server.crt instead of the given server cert. However, this problem occurs only for some clouds so I looked for any differences.
@tlm - this is causing much pain for us as it stops us from adding/removing instances without using lxc explicitly etc. We also don’t know how to get out of the situation. We are really in need of getting this resolved…
But we still have this in the controller juju debug-log for all clouds:
machine-0: 11:27:13 ERROR juju.provider.lxd failed to get instances from LXD: Get "https://192.168.111.4:8443/1.0/instances?instance-type=container&project=default&recursion=1": x509: certificate is valid for 127.0.0.1, ::1, not 192.168.111.4
machine-0: 11:27:13 ERROR juju.worker.dependency "instance-poller" manifold worker returned unexpected error: Get "https://192.168.111.4:8443/1.0/instances?instance-type=container&project=default&recursion=1": x509: certificate is valid for 127.0.0.1, ::1, not 192.168.111.4
Also immediately after restarting the juju service in the controller, juju status shows:
Machine State Address Inst id Series AZ Message
0 error 192.168.111.64 juju-0de9a0-0 focal cannot upgrade machine's lxd profile: 0: Get "https://192.168.111.2:8443/1.0/instances/juju-0de9a0-0": x509: certific...
But after a while it resolves automatically.
Going through all the credentials in the DB with db.cloudCredentials.find() everything looks correct.
So where is this certificate that the controller is trying to use??
@tlm I am sorry it took so long, but I was buried with work in another project.
I will send you the file in mattermost.
Can you give me the workspace address?
I have a controller running locally in a container.
It has a credential to access my local LXD host.
This is my typical development setup and it works without any problems.
Now, I launch a LXC container and configure it as a LXD host.
I add the new container as a Cloud to the controller.
I add a credential for it.
Now I start to get this ERROR message in the controller
machine-0: 11:18:51 ERROR juju.provider.lxd failed to get instances from LXD: Get "https://10.207.153.1:8443/1.0/instances?instance-type=container&project=default&recursion=1": x509: certificate is valid for 127.0.0.1, ::1, not 10.207.153.1
machine-0: 11:18:51 ERROR juju.worker.dependency "instance-poller" manifold worker returned unexpected error: Get "https://10.207.153.1:8443/1.0/instances?instance-type=container&project=default&recursion=1": x509: certificate is valid for 127.0.0.1, ::1, not 10.207.153.1
The IP 10.207.153.1 is to my local LXD host that worked without any problem before.
The IP of the new LXC container set up as a LXD host is 10.207.153.95.
So it seems like as soon as I add a new Cloud, the others starts to fail. I think I have seen this behavior on our production environment as well but since we have quite many clouds it has been difficult to verify from just looking at the debug-log.
And with existing clouds, running update-credential on a cloud fixes that cloud but breaks all the other clouds.
@tlm I’m happy to give you a demo of this if you have the time.
For those following along with this bug. I have been able to reproduce the bug with the help of @joakimnyman.
We will keep looking into the why. For the moment it looks to like Juju is sending the correct credentials to each individual cloud and the error is happening further down in the LXD client code.