Cant deploy - controller seems no to like certificates

Hi @erik-lonroth and happy new year.

From my memory the lxd client code does not use the ip addresses baked inside of the certificate as part of the client to server connection validation. However when it receives a certificate that it wasn’t expecting then it performs this check.

I think the best place to start with this issue is to look at how you made the credentials for the cloud that Juju is using. Would you please provide as much detail as possible about how you package the lxd credentials and uploaded them to Juju for the model? Information on where you obtained the certificate files from for the user and the server and what key values they were plugged into will be very helpful.

This is the best starting point to figure out the problem and we can arrange a phone call from there if need be.

Ta Tom

1 Like

The credentials was added in 3 different ways.

  1. Extraced the lxd server.crt
  2. Extracted client.crt and client.key
  3. Manually added to credentials.yaml
  4. Ran juju add-credentials

Or

  1. Added the lxd remote with “lxc remote add…”
  2. juju add-cloud
  3. Ran juju autoload-credentials

Or

  1. juju add-credentials

A combination of these methods above has been used during the two years we had them. Since we don’t fully know which is better and we don’t have a good strategy for managing credentials. Also, different users might use different methods.

We don’t even know WHICH certificate is referenced. LXD cert? Controller certs? User certs? There are so many and juju doesn’t give any clues.

Hey @erik-lonroth,

I have made the following bug to track this better Bug #2003135 “x509 Certificate Validation For LXD Clouds and Cre...” : Bugs : juju .

As stated in our call last night I will do some investigation into the lxd client code and figure out under what circumstances it does not skip ip address checks in the certificate.

tlm

1 Like

I’m glad there is a bug and just let us know what we can do to support in the debugging session.

I have updated the bug with information @tlm as the problem remains also after upgrade from 2.9.37 -> 2.9.38.

Hi all

I will add my two cents in juju 3.2. The only way that was successful for me was to add lxd remote and then run juju autoload-credentials.

I observed a weird behaviour(in my case I am using a 3 node LXD cluster). When I added my lxd cloud using the traditional juju add-cloud I observed that for some interesting reason juju was adding the /var/snap/lxd/common/lxd/server.crt from the vm that the juju snap is running from(for the sake of explanation let us call it juju-client) in spite of getting different IP addresses during initialization. The machine has also lxd installed, so juju found a localhost lxd cloud and is using it’s server.crt for every new lxd cloud added by juju add-cloud.

I tried to manually remove credentials using juju remove-cloud but it did not resolve the problem, as juju kept adding the server cert from the local lxd snap instead of using the address provided during adding proccess.

When i on the other hand used lxd remote add and added my cluster(using second approach described above), then ran juju autoload-credentials it took /var/snap/lxd/common/lxd/cluster.crt and copied(correctly) the client certs from /var/snap/lxd/common/lxc of the local lxd snap and cluster.crt from one of my cluster members, probably the one whose address I specified during lxd remote add.

Of course one could argue that you can just copy the certs from one section of ~/.local/share/juju/credentials.yaml but IMHO it should not be like that.

Regards

Mateusz

@tlm this might be what we see when we see the issues with our controller complaining about 127.0.0.1 because its a similar situation. @joakimnyman

Thanks @hypeitnow and @erik-lonroth,

I will have a quick dig today based on the information provided. @hypeitnow I may need some more information from you as not everything above makes a lot of sense but I will see how I go first.

Hi @hypeitnow,

I have had a look at this today from both the snap and a fresh build of Juju and I can’t replicate what is being talked about from your end.

Adding a remote LXD cloud to Juju never pulls in the local LXD server certificate both for interactive and certificate methods.

Would you be able to provide more information on what you are doing and seeing. If you would like to share your credentials.yaml file as well we can take a look. But more importantly the steps to reproduce from your end will help a lot.

You can use the unix script command to record your Juju session. Please wipe and private keys from the files and or you can submit the data to myself on Mattermost as well.

Ta tlm

We have the same behavior as @hypeitnow where it would upload the clients LXD server.crt instead of the given server cert. However, this problem occurs only for some clouds so I looked for any differences.

Working cloud:

defined: public
type: lxd
auth-types: [certificate]
endpoint: https://192.168.111.4:8443
credential-count: 1
regions:
  sodertalje: {}

Not working cloud:

defined: public
type: lxd
auth-types: [certificate]
credential-count: 1
regions:
  sodertalje:
    endpoint: https://192.168.111.6:8443

See the difference here where the endpoint is set. So for the “Not working cloud” I had to set --region sodertalje for it to work. Example:

juju update-credential cloud9 --region sodertalje cloud9-credential

1 Like

@tlm - this is causing much pain for us as it stops us from adding/removing instances without using lxc explicitly etc. We also don’t know how to get out of the situation. We are really in need of getting this resolved…

But we still have this in the controller juju debug-log for all clouds:

machine-0: 11:27:13 ERROR juju.provider.lxd failed to get instances from LXD: Get "https://192.168.111.4:8443/1.0/instances?instance-type=container&project=default&recursion=1": x509: certificate is valid for 127.0.0.1, ::1, not 192.168.111.4
machine-0: 11:27:13 ERROR juju.worker.dependency "instance-poller" manifold worker returned unexpected error: Get "https://192.168.111.4:8443/1.0/instances?instance-type=container&project=default&recursion=1": x509: certificate is valid for 127.0.0.1, ::1, not 192.168.111.4

Also immediately after restarting the juju service in the controller, juju status shows:

Machine  State  Address         Inst id        Series  AZ  Message
0        error  192.168.111.64  juju-0de9a0-0  focal       cannot upgrade machine's lxd profile: 0: Get "https://192.168.111.2:8443/1.0/instances/juju-0de9a0-0": x509: certific...

But after a while it resolves automatically.

Going through all the credentials in the DB with db.cloudCredentials.find() everything looks correct.

So where is this certificate that the controller is trying to use??

@tlm I am sorry it took so long, but I was buried with work in another project. I will send you the file in mattermost. Can you give me the workspace address?

Thank you for your help

I have been able to reproduce this locally.

I have a controller running locally in a container.
It has a credential to access my local LXD host.
This is my typical development setup and it works without any problems.

Now, I launch a LXC container and configure it as a LXD host.
I add the new container as a Cloud to the controller.
I add a credential for it.

Now I start to get this ERROR message in the controller

machine-0: 11:18:51 ERROR juju.provider.lxd failed to get instances from LXD: Get "https://10.207.153.1:8443/1.0/instances?instance-type=container&project=default&recursion=1": x509: certificate is valid for 127.0.0.1, ::1, not 10.207.153.1
machine-0: 11:18:51 ERROR juju.worker.dependency "instance-poller" manifold worker returned unexpected error: Get "https://10.207.153.1:8443/1.0/instances?instance-type=container&project=default&recursion=1": x509: certificate is valid for 127.0.0.1, ::1, not 10.207.153.1

The IP 10.207.153.1 is to my local LXD host that worked without any problem before. The IP of the new LXC container set up as a LXD host is 10.207.153.95.

So it seems like as soon as I add a new Cloud, the others starts to fail. I think I have seen this behavior on our production environment as well but since we have quite many clouds it has been difficult to verify from just looking at the debug-log.

And with existing clouds, running update-credential on a cloud fixes that cloud but breaks all the other clouds.

@tlm I’m happy to give you a demo of this if you have the time.

1 Like

Hey @hypeitnow

Our public Mattermost is https://chat.charmhub.io/

Ta tlm

Hey @joakimnyman,

This is fantastic news. A demo would be much appreciated to help me get an understanding of the repo case.

Would you like to ping in the Juju public Mattermost and we can setup a time?

In the meantime I’ll digest the above and see if I can set up an environment on my end.

Ta tlm

For those following along with this bug. I have been able to reproduce the bug with the help of @joakimnyman.

We will keep looking into the why. For the moment it looks to like Juju is sending the correct credentials to each individual cloud and the error is happening further down in the LXD client code.

Will setup some more tests to drill down on this.

2 Likes

@tlm did you have an update on this?

Hi All,

Exciting news. We have found the culprit behind this bug and have proposed a PR to fix the issue here: https://github.com/juju/juju/pull/15416

I will get this rolled out for all Juju releases from 2.9 and above over the coming weeks.

1 Like

We are looking forward to see this coming as it has terrorized us for months.