Cant deploy - controller seems no to like certificates

erik-lonroth · 13 January 2023 13:33

I have a multi-cloud controller (dwellir-sodertalje) running 4 remote lxd clouds with multiple users.

It was working perfectly fine until yesterday. However, now we’ve started having issues:

Controller dwellir-sodertalje 2.9.35 seems unable to deploy or remove machines.
The following sequence renders the error:

juju add-model foobar dwellir
juju deploy tiny-bash

tail -f /var/log/juju/models/admin-foobar-d84e4b.log

2023-01-13 12:55:21 INFO juju.worker.provisioner provisioner_task.go:1348 trying machine 0 StartInstance in availability zone dwellir1
2023-01-13 12:55:21 WARNING juju.worker.provisioner provisioner_task.go:1363 machine 0 failed to start in availability zone dwellir1: Post "https://192.168.111.2:8443/1.0/instances?project=default": x509: certificate is valid for 127.0.0.1, ::1, not 192.168.111.2
2023-01-13 12:55:21 WARNING juju.worker.provisioner provisioner_task.go:1405 failed to start machine 0 (Post "https://192.168.111.2:8443/1.0/instances?project=default": x509: certificate is valid for 127.0.0.1, ::1, not 192.168.111.2), retrying in 10s (10 more attempts)
2023-01-13 12:55:31 INFO juju.worker.provisioner provisioner_task.go:1348 trying machine 0 StartInstance in availability zone dwellir1
2023-01-13 12:55:31 WARNING juju.worker.provisioner provisioner_task.go:1363 machine 0 failed to start in availability zone dwellir1: Get "https://192.168.111.2:8443/1.0/images/aliases/juju%2Fbionic%2Famd64?project=default": x509: certificate is valid for 127.0.0.1, ::1, not 192.168.111.2
2023-01-13 12:55:31 WARNING juju.worker.provisioner provisioner_task.go:1405 failed to start machine 0 (Get "https://192.168.111.2:8443/1.0/images/aliases/juju%2Fbionic%2Famd64?project=default": x509: certificate is valid for 127.0.0.1, ::1, not 192.168.111.2), retrying in 10s (9 more attempts)

If I bootstrap a new controller - it all works using that new controller, so it seems that LXD is all fine and I can use lxd without any issues.

I have had this issue before and @tlm helped us look at this, and it seemed to be resolved by us adding and removing certificates. But we don’t know exactly what we did in the end to resolve it then. This issue has been an ongoing situation and we really need help resolving this as we are now suffering from not being able to work with the controller.

We are prepared to migrate into a new controller if needed, but we need assistance doing such a move as this is over our heads to resolve.

@tmihoc can you help us ping in someone who could pull us lose?

tmihoc · 13 January 2023 13:45

@tlm, @hmlanigan , @manadart ?

erik-lonroth · 13 January 2023 13:56

When I inspect the server.crt for the LXD host, I can see that the certificate is indeed valid for:

root@dwellir5:~# openssl x509 -noout -text -in /var/snap/lxd/common/lxd/server.crt

X509v3 Subject Alternative Name: DNS:dwellir5, IP Address:127.0.0.1, IP Address:0:0:0:0:0:0:0:1

Perhaps this is a lead for what goes on here? @stgraber

Running locally the command from my remote client towards 192.168.111.4 (dwellir5) works just fine:

lxc list dwellir5:

So it seems the lxc client on my local client is all good with the server.crt whereas the juju controller doesn’t like the same?

erik-lonroth · 13 January 2023 14:16

I have read here https://github.com/lxc/lxd/issues/10286 and compared my server.crt with that of

root@dwellir5:~# lxc query /1.0 | jq -r .environment.certificate

… and they are the same.

Would I need to re-create server certs to reflect the lxd-remote IP:s?

tlm · 15 January 2023 23:26

Hi @erik-lonroth and happy new year.

From my memory the lxd client code does not use the ip addresses baked inside of the certificate as part of the client to server connection validation. However when it receives a certificate that it wasn’t expecting then it performs this check.

I think the best place to start with this issue is to look at how you made the credentials for the cloud that Juju is using. Would you please provide as much detail as possible about how you package the lxd credentials and uploaded them to Juju for the model? Information on where you obtained the certificate files from for the user and the server and what key values they were plugged into will be very helpful.

This is the best starting point to figure out the problem and we can arrange a phone call from there if need be.

Ta Tom

erik-lonroth · 16 January 2023 23:37

The credentials was added in 3 different ways.

Extraced the lxd server.crt
Extracted client.crt and client.key
Manually added to credentials.yaml
Ran juju add-credentials

Or

Added the lxd remote with “lxc remote add…”
juju add-cloud
Ran juju autoload-credentials

Or

juju add-credentials

A combination of these methods above has been used during the two years we had them. Since we don’t fully know which is better and we don’t have a good strategy for managing credentials. Also, different users might use different methods.

We don’t even know WHICH certificate is referenced. LXD cert? Controller certs? User certs? There are so many and juju doesn’t give any clues.

tlm · 17 January 2023 22:51

Hey @erik-lonroth,

I have made the following bug to track this better Bug #2003135 “x509 Certificate Validation For LXD Clouds and Cre...” : Bugs : juju .

As stated in our call last night I will do some investigation into the lxd client code and figure out under what circumstances it does not skip ip address checks in the certificate.

tlm

erik-lonroth · 18 January 2023 18:05

I’m glad there is a bug and just let us know what we can do to support in the debugging session.

erik-lonroth · 18 January 2023 19:53

I have updated the bug with information @tlm as the problem remains also after upgrade from 2.9.37 -> 2.9.38.

hypeitnow · 19 February 2023 13:39

Hi all

I will add my two cents in juju 3.2. The only way that was successful for me was to add lxd remote and then run juju autoload-credentials.

I observed a weird behaviour(in my case I am using a 3 node LXD cluster). When I added my lxd cloud using the traditional juju add-cloud I observed that for some interesting reason juju was adding the /var/snap/lxd/common/lxd/server.crt from the vm that the juju snap is running from(for the sake of explanation let us call it juju-client) in spite of getting different IP addresses during initialization. The machine has also lxd installed, so juju found a localhost lxd cloud and is using it’s server.crt for every new lxd cloud added by juju add-cloud.

I tried to manually remove credentials using juju remove-cloud but it did not resolve the problem, as juju kept adding the server cert from the local lxd snap instead of using the address provided during adding proccess.

When i on the other hand used lxd remote add and added my cluster(using second approach described above), then ran juju autoload-credentials it took /var/snap/lxd/common/lxd/cluster.crt and copied(correctly) the client certs from /var/snap/lxd/common/lxc of the local lxd snap and cluster.crt from one of my cluster members, probably the one whose address I specified during lxd remote add.

Of course one could argue that you can just copy the certs from one section of ~/.local/share/juju/credentials.yaml but IMHO it should not be like that.

Regards

Mateusz

erik-lonroth · 19 February 2023 16:16

@tlm this might be what we see when we see the issues with our controller complaining about 127.0.0.1 because its a similar situation. @joakimnyman

tlm · 20 February 2023 23:09

Thanks @hypeitnow and @erik-lonroth,

I will have a quick dig today based on the information provided. @hypeitnow I may need some more information from you as not everything above makes a lot of sense but I will see how I go first.

tlm · 21 February 2023 04:02

Hi @hypeitnow,

I have had a look at this today from both the snap and a fresh build of Juju and I can’t replicate what is being talked about from your end.

Adding a remote LXD cloud to Juju never pulls in the local LXD server certificate both for interactive and certificate methods.

Would you be able to provide more information on what you are doing and seeing. If you would like to share your credentials.yaml file as well we can take a look. But more importantly the steps to reproduce from your end will help a lot.

You can use the unix script command to record your Juju session. Please wipe and private keys from the files and or you can submit the data to myself on Mattermost as well.

Ta tlm

joakimnyman · 2 March 2023 10:00

We have the same behavior as @hypeitnow where it would upload the clients LXD server.crt instead of the given server cert. However, this problem occurs only for some clouds so I looked for any differences.

Working cloud:

defined: public
type: lxd
auth-types: [certificate]
endpoint: https://192.168.111.4:8443
credential-count: 1
regions:
  sodertalje: {}

Not working cloud:

defined: public
type: lxd
auth-types: [certificate]
credential-count: 1
regions:
  sodertalje:
    endpoint: https://192.168.111.6:8443

See the difference here where the endpoint is set. So for the “Not working cloud” I had to set --region sodertalje for it to work. Example:

juju update-credential cloud9 --region sodertalje cloud9-credential

erik-lonroth · 2 March 2023 10:23

@tlm - this is causing much pain for us as it stops us from adding/removing instances without using lxc explicitly etc. We also don’t know how to get out of the situation. We are really in need of getting this resolved…

joakimnyman · 2 March 2023 10:32

But we still have this in the controller juju debug-log for all clouds:

machine-0: 11:27:13 ERROR juju.provider.lxd failed to get instances from LXD: Get "https://192.168.111.4:8443/1.0/instances?instance-type=container&project=default&recursion=1": x509: certificate is valid for 127.0.0.1, ::1, not 192.168.111.4
machine-0: 11:27:13 ERROR juju.worker.dependency "instance-poller" manifold worker returned unexpected error: Get "https://192.168.111.4:8443/1.0/instances?instance-type=container&project=default&recursion=1": x509: certificate is valid for 127.0.0.1, ::1, not 192.168.111.4

Also immediately after restarting the juju service in the controller, juju status shows:

Machine  State  Address         Inst id        Series  AZ  Message
0        error  192.168.111.64  juju-0de9a0-0  focal       cannot upgrade machine's lxd profile: 0: Get "https://192.168.111.2:8443/1.0/instances/juju-0de9a0-0": x509: certific...

But after a while it resolves automatically.

Going through all the credentials in the DB with db.cloudCredentials.find() everything looks correct.

So where is this certificate that the controller is trying to use??

hypeitnow · 2 March 2023 21:13

@tlm I am sorry it took so long, but I was buried with work in another project. I will send you the file in mattermost. Can you give me the workspace address?

Thank you for your help

joakimnyman · 3 March 2023 10:28

I have been able to reproduce this locally.

I have a controller running locally in a container.
It has a credential to access my local LXD host.
This is my typical development setup and it works without any problems.

Now, I launch a LXC container and configure it as a LXD host.
I add the new container as a Cloud to the controller.
I add a credential for it.

Now I start to get this ERROR message in the controller

machine-0: 11:18:51 ERROR juju.provider.lxd failed to get instances from LXD: Get "https://10.207.153.1:8443/1.0/instances?instance-type=container&project=default&recursion=1": x509: certificate is valid for 127.0.0.1, ::1, not 10.207.153.1
machine-0: 11:18:51 ERROR juju.worker.dependency "instance-poller" manifold worker returned unexpected error: Get "https://10.207.153.1:8443/1.0/instances?instance-type=container&project=default&recursion=1": x509: certificate is valid for 127.0.0.1, ::1, not 10.207.153.1

The IP 10.207.153.1 is to my local LXD host that worked without any problem before. The IP of the new LXC container set up as a LXD host is 10.207.153.95.

So it seems like as soon as I add a new Cloud, the others starts to fail. I think I have seen this behavior on our production environment as well but since we have quite many clouds it has been difficult to verify from just looking at the debug-log.

And with existing clouds, running update-credential on a cloud fixes that cloud but breaks all the other clouds.

@tlm I’m happy to give you a demo of this if you have the time.

tlm · 6 March 2023 06:17

Hey @hypeitnow

Our public Mattermost is https://chat.charmhub.io/

Ta tlm

tlm · 6 March 2023 06:19

Hey @joakimnyman,

This is fantastic news. A demo would be much appreciated to help me get an understanding of the repo case.

Would you like to ping in the Juju public Mattermost and we can setup a time?

In the meantime I’ll digest the above and see if I can set up an environment on my end.

Ta tlm