Juju 3.6 TLS handshake timeout, preventing migration.

We are trying to upgrade our juju 3.3 controllers to 3.6 and are running into a number of strange problems. Essentially the setup goal is like this:

  • The controllers are hosted on Noble
  • We preinstantiate them and use manual/user@IP as the cloud
  • We also use external users from api.jujucharms.com/identity
  • The containers with the controller on them are LXC containers which live on a 3 machine cluster.

We can get the controller to bootstrap sometimes, but no matter what we do we cannot get it to authenticate external users, and we cannot get it to add models. All the errors claim there are TLS errors of various kinds (handshake timeouts, invalid certificates, no certificates at all etc.). For example, when we try to log in with an external user, the client gives us the following error message:

ERROR cannot log into "192.168.111.67:17070": cannot create macaroon: cannot add caveat checkers.Caveat{Condition:"need-declared username is-authenticated-user", Namespace:"", Location:"https://api.jujucharms.com/identity"}: cannot find public key for location "https://api.jujucharms.com/identity": Get "https://api.jujucharms.com/identity/discharge/info": net/http: TLS handshake timeout (unauthorized access)

Juju bootstrap example command that is used:

juju_36 bootstrap manual/ubuntu@<ip-address> <controller-name> --config identity-url=https://api.jujucharms.com/identity

Test deploy on a controller that managed to get bootstrapped:

$ juju_36 deploy tiny-bash tiny-bash2 -v --debug

15:44:00 INFO  juju.cmd supercommand.go:56 running juju [3.6.2 87cae7505aee356eda90d98ae345e1c11eb26c72 gc go1.23.4]
15:44:00 DEBUG juju.cmd supercommand.go:57   args: []string{"/snap/juju_36/29493/bin/juju", "deploy", "tiny-bash", "tiny-bash2", "-v", "--debug"}
15:44:00 INFO  juju.juju api.go:86 connecting to API addresses: [192.168.108.38:17070]
15:44:00 DEBUG juju.api apiclient.go:1035 successfully dialed "wss://192.168.108.38:17070/model/e82530e0-894a-469e-8ece-1c2bbb604d7f/api"
15:44:00 INFO  juju.api apiclient.go:570 connection established to "wss://192.168.108.38:17070/model/e82530e0-894a-469e-8ece-1c2bbb604d7f/api"
15:44:00 INFO  juju.juju api.go:86 connecting to API addresses: [192.168.108.38:17070]
15:44:01 DEBUG juju.api apiclient.go:1035 successfully dialed "wss://192.168.108.38:17070/api"
15:44:01 INFO  juju.api apiclient.go:570 connection established to "wss://192.168.108.38:17070/api"
15:44:21 DEBUG juju.api monitor.go:35 RPC connection died
15:44:21 DEBUG juju.api monitor.go:35 RPC connection died
ERROR resolving with preferred channel: Post "https://api.charmhub.io/v2/charms/refresh": net/http: TLS handshake timeout
15:44:21 DEBUG cmd supercommand.go:549 error stack: 
resolving with preferred channel: Post "https://api.charmhub.io/v2/charms/refresh": net/http: TLS handshake timeout
github.com/juju/juju/cmd/juju/application/store.(*CharmAdaptor).ResolveCharm:68: 
github.com/juju/juju/cmd/juju/application/store.(*CharmAdaptor).ResolveBundleURL:85: 
github.com/juju/juju/cmd/juju/application/deployer.(*factory).repoBundleDeployer:201: 
github.com/juju/juju/cmd/juju/application/deployer.(*factory).GetDeployer:159: 
github.com/juju/juju/cmd/juju/application.(*DeployCommand).Run:855:

We have tried different ubuntu server OS versions (22.04 & 24.04), different container os versions, different servers, different network. All results in the same way.

We have been trying to figure this out for some time now and can’t get to the bottom of things. Does anyone have any ideas?

EDIT At later states we have successfully been able to login with external users but we still get this issue when deploying charm etc.

@wallyworld @erik-lonroth @awnns @jnsgruk

1 Like

I can confirm that I also get this.

I can’t reproduce it on a local lxd cloud when the client is on the same host as the controller, but I can reproduce it on a controller on a remote lxd.

I can also experience this kind of effect, also where the TLS error seems involved:

https://asciinema.org/a/zqKpjpg2lpIeyVVFeZJHA6Ner

Hey @marcus and @erik-lonroth,

As discussed on Matrix here, it looks like it might be some issue with issue with loading the state from 3.3 into 3.6. Is there a reason you need to copy the state over from 3.3 using the backup mechanism rather than recreating the necessary config on a new 3.6 controller manually, and then migrating to that?

We are pretty dead in the water since about a week and we aren’t even close to be able to figure out what goes on here, yet how to fix it. I was talking to @hallback who also have seen like “similar” issues this week, although I think it might be a separate issue.

Are we possibly facing multiple issues?

We need help to get lose from this.

This is the way we’ve been told to do it. Though the issue we describe is on a newly bootstrapped 3.6 controller where we haven’t restored anything from backup yet. So basically a fresh one unrelated to our old controller.

I have noticed something similar when just deploying stuff. Using Juju 3.6.2 on a microk8s cloud.

ubuntu@juju-4fa043-kraken-cos-1:~/mimir-bundle$ juju deploy ./bundle.yaml --trust
ERROR cannot deploy bundle: cannot resolve charm or bundle "mimir-coordinator-k8s": resolving with preferred channel: Post "https://api.charmhub.io/v2/charms/refresh": net/http: TLS handshake timeout
ubuntu@juju-4fa043-kraken-cos-1:~/mimir-bundle$ juju deploy ./bundle.yaml --trust
ERROR cannot deploy bundle: cannot resolve charm or bundle "mimir-coordinator-k8s": resolving with preferred channel: Post "https://api.charmhub.io/v2/charms/refresh": net/http: TLS handshake timeout
ubuntu@juju-4fa043-kraken-cos-1:~/mimir-bundle$ juju deploy ./bundle.yaml --trust
ERROR cannot deploy bundle: cannot resolve charm or bundle "mimir-coordinator-k8s": resolving with preferred channel: Post "https://api.charmhub.io/v2/charms/refresh": net/http: TLS handshake timeout

Seconds later after a couple of retries, it suddenly works:

ubuntu@juju-4fa043-kraken-cos-1:~/mimir-bundle$ juju deploy ./bundle.yaml --trust
Located charm "mimir-worker-k8s" in charm-hub, channel latest/edge
Located charm "s3-integrator" in charm-hub, channel latest/edge
Executing changes:
- upload charm mimir-coordinator-k8s from charm-hub from channel edge with architecture=amd64
- upload charm mimir-worker-k8s from charm-hub from channel edge with architecture=amd64
- deploy application mimir-worker from charm-hub with 1 unit with edge using mimir-worker-k8s
  added resource mimir-image
- upload charm s3-integrator from charm-hub from channel edge with architecture=amd64
- deploy application s3-integrator from charm-hub with 1 unit with edge
- add relation mimir:mimir-cluster - mimir-worker:mimir-cluster
- add relation mimir:s3 - s3-integrator:s3-credentials
Deploy of bundle completed.

I haven’t investigated it further than noticing that api.charmhub.io resolves to four IPv4 addresses. Maybe one of the or more of these services have some kind of issue.

Its strange that the same issues is appearing in these very distinct parts of the codebase. Just to check, do you have a HTTP proxy set on the 3.6 controllers?

@aflynn we don’t have a proxy on our end. No.

@hallback - we get the same kind of error net/http: TLS handshake timeout

Curious, does running juju info <somecharmname> work? This runs on your local machine and and not the noble instances on which the juju agents are running but still hits the charmhub api endpoints. Might provide a clue.

We have the similar issue. It is not for a migration, but on a new controller running juju 3.6.2. We are getting intermittant TLS Handshake timeout while downloading new charm or bundle. TCP connection get established just fine, but we never received the server TLS HELLO packet after sending the client HELLO packet. No proxy, but PaloAlto firewall in the path. Although, the TLS decryption, IDS/IPS have been disabled on those communication. Also, we don’t see the server HELLO packet comming back.

1 Like

This is accurate. This occurs before we have gotten to the migration part. We get the same problem when we get past the identity provider (which we can if we fetch the url with a different program). There are other programs on the same systems that can fetch these urls with no problems, such as cURL.

We have a similar situation, but with a different firewall.

@awnns got some indication that jumboframe MTU (we have it internally) might cause issues with TLS when sent externally with MTU 1500. I think its far fetched, but we are dead in the water for clues how to fix this.

In our situation we are using 9000 MTU as well. Although, I am not seeing any fragmentation in my tcpdump.

I am looking into this precisely and while I have not figured out all the details (this is not my specialty and I am far out of my depth on this one), it would appear that there is something that prevents PMTUD from working properly. Specifically, when I send large ICMP packets with ping, they never receive any replies unless they are under 1500 bytes. I also never receive any ping responses from api.jujucharms.com which is where the identity provider lives. Now, the part where we don’t receive requests for lower MTU over ICMP I am not entirely sure why, but I do observe that it would seem that cURL can fetch from the server regardless of the MTU setting since its Client Hello message is only around 500 bytes as opposed to the 1564 bytes that Go’s internal TLS library uses.

To sum up (I am omitting a lot of details here) there are clear signs that there is something going on related to this (I believe the technical term is that we are experiencing an “MTU Blackhole”), but I have not been able to explain all of my observations yet.

As a continuation to all this, I heard that there were speculations there was something to do with golang versions. I recompiled the Juju 3.6.2 controller binary with go 1.22.12 and repeated my experiments. It turns out that the Client Hello sent by the new binary is much smaller, around 300 bytes, so this comfortably fits within the bounds of even the most restrictive network conditions. This also translated into tangible results, and there were no more TLS timeouts.

@awnns what where the changes to your go.mod/go.sum files?

I think this might now also affect charmcraft.

I’m experiencing difficulties when interacting with the charmhub…

 charmcraft login -v
Starting charmcraft, version 3.4.2
Logging execution to '/home/erik/.local/state/charmcraft/log/charmcraft-20250212-021628.642306.log'
Opening an authorization web page in your browser.
If it does not open, please open this URL:
 https://api.jujucharms.com/identity/login?did=f647585cXXXXXXXXXXXXX

Checking log

2025-02-12 02:16:28.643 Log verbosity level set to BRIEF
2025-02-12 02:16:28.643 Preparing application...
2025-02-12 02:16:28.643 Configuring application...
2025-02-12 02:16:28.644 Setting up ConfigService
2025-02-12 02:16:28.655 Build plan: platform=None, build_for=None
2025-02-12 02:16:28.656 Running charmcraft login on host
2025-02-12 02:16:28.656 Setting up StoreService
2025-02-12 02:16:28.831 HTTP 'POST' for 'https://api.charmhub.io/v1/tokens' with params None and headers {'Content-Type': 'application/json', 'Accept': 'application/json', 'User-Agent': 'Charmcraft/3.4.2 (Linux 6.8.0-52-generic; x86_64; CPython 3.10.12; Ubuntu 22.04)'}
2025-02-12 02:16:29.450 HTTP 'GET' for 'https://api.jujucharms.com/identity/wait-token?did=f647585ceb546f9fa734fbcb8daa98356eeab32ff355cXXXXXXXXXXXXXXX' with params None and headers {'User-Agent': 'Charmcraft/3.4.2 (Linux 6.8.0-52-generic; x86_64; CPython 3.10.12; Ubuntu 22.04)'}
2025-02-12 02:20:58.310 HTTP 'POST' for 'https://api.charmhub.io/v1/tokens/exchange' with params None and headers {'Macaroons': '<macaroon>', 'User-Agent': 'Charmcraft/3.4.2 (Linux 6.8.0-52-generic; x86_64; CPython 3.10.12; Ubuntu 22.04)'}

Blocks forever.

This is a new experimental post-quantum key exchange mechanism that was enabled by default in go 1.23.

The experimental post-quantum key exchange mechanism X25519Kyber768Draft00 is now enabled by default when Config.CurvePreferences is nil. The default can be reverted by adding tlskyber=0 to the GODEBUG environment variable.

It’s possible to turn this off using the GODEBUG environment variable, and might resolve the problems we’re seeing.

Where should it be set? Client and/or controller? For the snap?