Juju refresh machine ip

When a machine changes ip due to dhcp and restarting, there should be a way to quickly inform juju about that or make juju refresh the info it has about the machine

Hey John, could you provide some more information on the specific setup where you encountered this issue. In your setup does Juju eventually rediscover the machine or does it stay in an error state?

Hello @aflynn, it persistently stays in the error state. Actually it’s still in that state since the day I posted the question. This has prompted me trying to go down the rabbit hole seeing how to debug the cause. But the codebase is quite complex though I’m making some progress grokking it but this is unnecessary indirection.

After debugging for a while, I’m able to narrow it down to the fact that the new IP address is not in the list of known hosts. Someone know how to fix this?

Also that is to say that the IP address does show up as part of the list of IP address for the machine as against my previous assumptions

The --no-host-key-checks options help to log in to machine even when the machine is in an error state and I see something like permanently adding to known hosts. I expect that the second time around it should go through but it seems like permanently added was not permanent after all :smile: so it continues with the error state and not able to log in without the --no-host-key-checks option

Thanks for doing the extra digging. Which cloud are your controller/machines on? This sounds like it could be an issue with the way Juju interacts with it.

Also what is the specific error? Is the error coming from the charm or Juju itself? There may be more context in the model logs (juju debug-log -m <model>) and the controller logs (juju debug-log -m controller).

@aflynn it’s an LXD cloud

johnbendi@generalbendi:~$ juju status
Model    Controller    Cloud/Region         Version  SLA          Timestamp
k8s-001  overlord-lxd  localhost/localhost  3.6.0    unsupported  06:27:39+01:00

App                       Version  Status   Scale  Charm                     Channel        Rev  Exposed  Message
calico                    v3.27.4  error        5  calico                    latest/stable  123  no       hook failed: "config-changed"
containerd                1.7.12   active       5  containerd                latest/stable   83  no       Container runtime available
easyrsa                   3.0.1    active     1/2  easyrsa                   latest/stable   66  no       Certificate Authority connected.
etcd                      3.4.22   blocked    3/4  etcd                      latest/stable  774  no       UnHealthy with 3 known peers
kubeapi-load-balancer     1.18.0   active       1  kubeapi-load-balancer     latest/stable  169  yes      Ready
kubernetes-control-plane  1.31.4   waiting      2  kubernetes-control-plane  latest/stable  558  no       Waiting for 1 kube-system pod to start
kubernetes-worker         1.31.4   blocked      3  kubernetes-worker         latest/stable  265  yes      Not all snaps are available on channel=1.31/stable

Unit                         Workload  Agent  Machine  Public address                          Ports         Message
easyrsa/0                    unknown   lost   0        fd42:c24d:2827:4f4c:216:3eff:fe6e:93d7                agent lost, see 'juju show-status-log easyrsa/0'
easyrsa/1*                   active    idle   13       fd42:c24d:2827:4f4c:216:3eff:fe25:3c19                Certificate Authority connected.
etcd/0*                      active    idle   1        10.198.60.51                            2379/tcp      Healthy with 4 known peers
etcd/1                       unknown   lost   2        10.198.60.87                            2379/tcp      agent lost, see 'juju show-status-log etcd/1'
etcd/2                       active    idle   3        10.198.60.134                           2379/tcp      Healthy with 4 known peers
etcd/3                       active    idle   12       10.198.60.102                           2379/tcp      Healthy with 4 known peers
kubeapi-load-balancer/0*     active    idle   4        10.198.60.26                            443,6443/tcp  Ready
kubernetes-control-plane/0   waiting   idle   5        10.198.60.118                           6443/tcp      Waiting for 1 kube-system pod to start
  calico/3                   error     idle            10.198.60.118                                         hook failed: "config-changed"
  containerd/3               active    idle            10.198.60.118                                         Container runtime available
kubernetes-control-plane/1*  waiting   idle   6        10.198.60.195                           6443/tcp      Waiting for 1 kube-system pod to start
  calico/4                   error     idle            10.198.60.195                                         hook failed: "config-changed"
  containerd/4               active    idle            10.198.60.195                                         Container runtime available
kubernetes-worker/0          blocked   idle   7        10.198.60.47                            80,443/tcp    Not all snaps are available on channel=1.31/stable
  calico/0                   waiting   idle            10.198.60.47                                          calico: Deployment/kube-system/calico-kube-controllers is not Available, calico: PodDisruptionBudget/kube-system/cali...
  containerd/0               active    idle            10.198.60.47                                          Container runtime available
kubernetes-worker/2          blocked   idle   9        10.198.60.157                           80,443/tcp    Not all snaps are available on channel=1.31/stable
  calico/1                   waiting   idle            10.198.60.157                                         calico: Deployment/kube-system/calico-kube-controllers is not Available, calico: PodDisruptionBudget/kube-system/cali...
  containerd/1               active    idle            10.198.60.157                                         Container runtime available
kubernetes-worker/3*         waiting   idle   11       10.198.60.88                            80,443/tcp    Waiting for certificate authority
  calico/5*                  waiting   idle            10.198.60.88                                          calico: Deployment/kube-system/calico-kube-controllers is not Available, calico: PodDisruptionBudget/kube-system/cali...
  containerd/5*              active    idle            10.198.60.88                                          Container runtime available

Machine  State    Address                                 Inst id         Base          AZ  Message
0        down     fd42:c24d:2827:4f4c:216:3eff:fe6e:93d7  juju-9317c7-0   ubuntu@22.04      Running
1        started  10.198.60.51                            juju-9317c7-1   ubuntu@22.04      Running
2        down     10.198.60.87                            juju-9317c7-2   ubuntu@22.04      Running
3        started  10.198.60.134                           juju-9317c7-3   ubuntu@22.04      Running
4        started  10.198.60.26                            juju-9317c7-4   ubuntu@22.04      Running
5        started  10.198.60.118                           juju-9317c7-5   ubuntu@22.04      Running
6        started  10.198.60.195                           juju-9317c7-6   ubuntu@22.04      Running
7        started  10.198.60.47                            juju-9317c7-7   ubuntu@22.04      Running
9        started  10.198.60.157                           juju-9317c7-9   ubuntu@22.04      Running
10       started  fd42:c24d:2827:4f4c:216:3eff:fecc:7563  juju-9317c7-10  ubuntu@24.04      Running
11       started  10.198.60.88                            juju-9317c7-11  ubuntu@22.04      Running
12       started  10.198.60.102                           juju-9317c7-12  ubuntu@22.04      Running
13       started  fd42:c24d:2827:4f4c:216:3eff:fe25:3c19  juju-9317c7-13  ubuntu@22.04      Running

Notice machine-0 and machine-2 is down and I had to add machine-12 and machine-13 to work around the down machines. Machine 12 wouldn’t join etcd without easyRSA on machine-0, so machine-13 to the rescue.

On machine-0, I have the relevant part of the machine agent log below:

2025-01-11 04:53:56 INFO juju.cmd supercommand.go:56 running jujud [3.6.0 6dec77a01480916689430d38ef3e032cb1e06b78 gc go1.23.3]
2025-01-11 04:53:56 DEBUG juju.cmd supercommand.go:57   args: []string{"/var/lib/juju/tools/machine-0/jujud", "machine", "--data-dir", "/var/lib/juju", "--machine-id", "0", "--debug"}
2025-01-11 04:53:56 DEBUG juju.utils gomaxprocs.go:24 setting GOMAXPROCS to 1
2025-01-11 04:53:56 DEBUG juju.agent agent.go:600 read agent config, format "2.0"
2025-01-11 04:53:56 INFO juju.worker.upgradesteps worker.go:60 upgrade steps for 3.6.0 have already been run.
2025-01-11 04:53:56 DEBUG juju.cmd.jujud runner.go:416 start "engine"
2025-01-11 04:53:56 INFO juju.cmd.jujud runner.go:592 start "engine"
2025-01-11 04:53:56 DEBUG juju.worker.dependency engine.go:580 "termination-signal-handler" manifold worker started at 2025-01-11 04:53:56.627307144 +0000 UTC
2025-01-11 04:53:56 DEBUG juju.worker.dependency engine.go:580 "charmhub-http-client" manifold worker started at 2025-01-11 04:53:56.627630912 +0000 UTC
2025-01-11 04:53:56 DEBUG juju.worker.dependency engine.go:580 "agent" manifold worker started at 2025-01-11 04:53:56.627658294 +0000 UTC
2025-01-11 04:53:56 DEBUG juju.worker.dependency engine.go:580 "clock" manifold worker started at 2025-01-11 04:53:56.62793581 +0000 UTC
2025-01-11 04:53:56 DEBUG juju.worker.apicaller connect.go:129 connecting with old password
2025-01-11 04:53:56 DEBUG juju.worker.dependency engine.go:580 "upgrade-steps-gate" manifold worker started at 2025-01-11 04:53:56.6293256 +0000 UTC
2025-01-11 04:53:56 DEBUG juju.worker.dependency engine.go:580 "syslog" manifold worker started at 2025-01-11 04:53:56.632331038 +0000 UTC
2025-01-11 04:53:56 DEBUG juju.worker.dependency engine.go:580 "state-config-watcher" manifold worker started at 2025-01-11 04:53:56.632482745 +0000 UTC
2025-01-11 04:53:56 DEBUG juju.worker.dependency engine.go:580 "upgrade-steps-flag" manifold worker started at 2025-01-11 04:53:56.632676917 +0000 UTC
2025-01-11 04:53:56 DEBUG juju.worker.dependency engine.go:580 "upgrade-check-gate" manifold worker started at 2025-01-11 04:53:56.633939272 +0000 UTC
2025-01-11 04:53:56 DEBUG juju.worker.introspection worker.go:125 introspection worker listening on "/var/lib/juju/agents/machine-0/introspection.socket"
2025-01-11 04:53:56 DEBUG juju.cmd.jujud runner.go:424 "engine" started
2025-01-11 04:53:56 DEBUG juju.worker.introspection worker.go:150 stats worker now serving
2025-01-11 04:53:56 DEBUG juju.worker.dependency engine.go:580 "api-config-watcher" manifold worker started at 2025-01-11 04:53:56.637954723 +0000 UTC
2025-01-11 04:53:56 DEBUG juju.api apiclient.go:1036 successfully dialed "wss://10.198.60.170:17070/model/f7411ed0-3b7a-44c6-899d-b8ed029317c7/api"
2025-01-11 04:53:56 INFO juju.api apiclient.go:571 connection established to "wss://10.198.60.170:17070/model/f7411ed0-3b7a-44c6-899d-b8ed029317c7/api"
2025-01-11 04:53:56 DEBUG juju.worker.dependency engine.go:580 "is-controller-flag" manifold worker started at 2025-01-11 04:53:56.642722767 +0000 UTC
2025-01-11 04:53:56 DEBUG juju.worker.dependency engine.go:580 "is-not-controller-flag" manifold worker started at 2025-01-11 04:53:56.642954896 +0000 UTC
2025-01-11 04:53:56 DEBUG juju.worker.dependency engine.go:580 "upgrade-check-flag" manifold worker started at 2025-01-11 04:53:56.645049596 +0000 UTC
2025-01-11 04:53:56 DEBUG juju.worker.apicaller connect.go:160 [f7411e] failed to connect
2025-01-11 04:53:56 ERROR juju.worker.apicaller connect.go:209 Failed to connect to controller: invalid entity name or password (unauthorized access)
2025-01-11 04:53:56 DEBUG juju.worker.dependency engine.go:618 "api-caller" manifold worker stopped: [f7411e] "machine-0" cannot open api: connection permanently impossible
stack trace:
github.com/juju/juju/worker/apicaller.init:42: connection permanently impossible
github.com/juju/juju/cmd/jujud/agent/machine.commonManifolds.Manifold.ManifoldConfig.startFunc.func35:97: [f7411e] "machine-0" cannot open api
2025-01-11 04:53:56 INFO juju.worker.stateconfigwatcher manifold.go:120 tomb dying
2025-01-11 04:53:56 DEBUG juju.worker.dependency engine.go:603 "state-config-watcher" manifold worker completed successfully
2025-01-11 04:53:56 DEBUG juju.worker.dependency engine.go:603 "termination-signal-handler" manifold worker completed successfully
2025-01-11 04:53:56 DEBUG juju.worker.dependency engine.go:603 "upgrade-steps-gate" manifold worker completed successfully
2025-01-11 04:53:56 DEBUG juju.worker.dependency engine.go:603 "is-controller-flag" manifold worker completed successfully
2025-01-11 04:53:56 DEBUG juju.worker.dependency engine.go:603 "upgrade-steps-flag" manifold worker completed successfully
2025-01-11 04:53:56 DEBUG juju.worker.dependency engine.go:603 "upgrade-check-flag" manifold worker completed successfully
2025-01-11 04:53:56 DEBUG juju.worker.dependency engine.go:603 "charmhub-http-client" manifold worker completed successfully
2025-01-11 04:53:56 DEBUG juju.worker.dependency engine.go:603 "agent" manifold worker completed successfully
2025-01-11 04:53:56 DEBUG juju.worker.dependency engine.go:603 "upgrade-check-gate" manifold worker completed successfully
2025-01-11 04:53:56 DEBUG juju.worker.dependency engine.go:603 "api-config-watcher" manifold worker completed successfully
2025-01-11 04:53:56 DEBUG juju.worker.dependency engine.go:603 "clock" manifold worker completed successfully
2025-01-11 04:53:56 DEBUG juju.worker.dependency engine.go:603 "is-not-controller-flag" manifold worker completed successfully
2025-01-11 04:53:56 DEBUG juju.worker.dependency engine.go:603 "syslog" manifold worker completed successfully
2025-01-11 04:53:56 INFO juju.cmd.jujud runner.go:623 stopped "engine", err: agent should be terminated
2025-01-11 04:53:56 DEBUG juju.cmd.jujud runner.go:430 "engine" done: agent should be terminated
2025-01-11 04:53:56 DEBUG juju.cmd.jujud runner.go:485 error "engine": agent should be terminated
2025-01-11 04:53:56 ERROR juju.cmd.jujud runner.go:487 fatal error "engine": agent should be terminated
2025-01-11 04:53:56 INFO cmd supercommand.go:556 command finished
2025-01-11 04:53:56 DEBUG juju.cmd.jujud main.go:286 jujud complete, code 0, err <nil>

This must be somehow due to a machine !P change but can’t really confirm because I didn’t take a record of it then, naively. Another surprising thing is that juju ssh 0 works but juju ssh 2 doesn’t.

johnbendi@generalbendi:~$ juju --debug ssh 0
07:17:47 INFO  juju.cmd supercommand.go:56 running juju [3.6.1 cdb5fe45b78a4701a8bc8369c5a50432358afbd3 gc go1.23.3]
07:17:47 DEBUG juju.cmd supercommand.go:57   args: []string{"/snap/juju/29241/bin/juju", "--debug", "ssh", "0"}
07:17:47 INFO  juju.juju api.go:86 connecting to API addresses: [10.198.60.170:17070 [fd42:c24d:2827:4f4c:216:3eff:fe98:9af0]:17070]
07:17:47 DEBUG juju.api apiclient.go:1035 successfully dialed "wss://[fd42:c24d:2827:4f4c:216:3eff:fe98:9af0]:17070/model/f7411ed0-3b7a-44c6-899d-b8ed029317c7/api"
07:17:47 INFO  juju.api apiclient.go:570 connection established to "wss://[fd42:c24d:2827:4f4c:216:3eff:fe98:9af0]:17070/model/f7411ed0-3b7a-44c6-899d-b8ed029317c7/api"
07:17:47 DEBUG juju.cmd.juju.ssh ssh_machine.go:345 proxy-ssh is false
07:17:47 DEBUG juju.network.ssh reachable.go:156 dialing [fd42:c24d:2827:4f4c:216:3eff:fe6e:93d7]:22 to check host keys
07:17:47 DEBUG juju.network.ssh reachable.go:156 dialing 10.198.60.53:22 to check host keys
07:17:47 DEBUG juju.network.ssh reachable.go:169 connected to 10.198.60.53:22, initiating ssh handshake
07:17:47 DEBUG juju.network.ssh reachable.go:169 connected to [fd42:c24d:2827:4f4c:216:3eff:fe6e:93d7]:22, initiating ssh handshake
07:17:47 DEBUG juju.network.ssh reachable.go:98 accepted host key for: [fd42:c24d:2827:4f4c:216:3eff:fe6e:93d7]:22
07:17:47 INFO  juju.network.ssh reachable.go:223 found [fd42:c24d:2827:4f4c:216:3eff:fe6e:93d7]:22 has an acceptable ssh key
07:17:47 DEBUG juju.cmd.juju.ssh ssh_machine.go:495 using target "0" address "fd42:c24d:2827:4f4c:216:3eff:fe6e:93d7"
07:17:47 DEBUG juju.network.ssh reachable.go:98 accepted host key for: 10.198.60.53:22
07:17:47 DEBUG juju.network.ssh reachable.go:181 ssh: handshake failed: host key was accepted, but search was stopped
07:17:47 DEBUG juju.utils.ssh ssh.go:305 using OpenSSH ssh client
Authenticated to fd42:c24d:2827:4f4c:216:3eff:fe6e:93d7 ([fd42:c24d:2827:4f4c:216:3eff:fe6e:93d7]:22) using "publickey".
Welcome to Ubuntu 22.04.5 LTS (GNU/Linux 6.8.0-48-generic x86_64)

 * Documentation:  https://help.ubuntu.com
 * Management:     https://landscape.canonical.com
 * Support:        https://ubuntu.com/pro

 System information disabled due to load higher than 1.0


Expanded Security Maintenance for Applications is not enabled.

0 updates can be applied immediately.

Enable ESM Apps to receive additional future security updates.
See https://ubuntu.com/esm or run: sudo pro status

New release '24.04.1 LTS' available.
Run 'do-release-upgrade' to upgrade to it.


Last login: Sat Jan 11 05:24:45 2025 from 10.198.60.1

For machine 2, juju --debug ssh 2:

johnbendi@generalbendi:~$ juju --debug ssh 2
06:37:43 INFO  juju.cmd supercommand.go:56 running juju [3.6.1 cdb5fe45b78a4701a8bc8369c5a50432358afbd3 gc go1.23.3]
06:37:43 DEBUG juju.cmd supercommand.go:57   args: []string{"/snap/juju/29241/bin/juju", "--debug", "ssh", "2"}
06:37:43 INFO  juju.juju api.go:86 connecting to API addresses: [[fd42:c24d:2827:4f4c:216:3eff:fe98:9af0]:17070 10.198.60.170:17070]
06:37:43 DEBUG juju.api apiclient.go:1035 successfully dialed "wss://10.198.60.170:17070/model/f7411ed0-3b7a-44c6-899d-b8ed029317c7/api"
06:37:43 INFO  juju.api apiclient.go:570 connection established to "wss://10.198.60.170:17070/model/f7411ed0-3b7a-44c6-899d-b8ed029317c7/api"
06:37:43 DEBUG juju.cmd.juju.ssh ssh_machine.go:345 proxy-ssh is false
06:37:43 DEBUG juju.network.ssh reachable.go:156 dialing [fd42:c24d:2827:4f4c:216:3eff:fe13:6951]:22 to check host keys
06:37:43 DEBUG juju.network.ssh reachable.go:156 dialing 10.198.60.87:22 to check host keys
06:37:43 DEBUG juju.network.ssh reachable.go:156 dialing 10.198.60.86:22 to check host keys
06:37:43 DEBUG juju.network.ssh reachable.go:169 connected to 10.198.60.86:22, initiating ssh handshake
06:37:43 DEBUG juju.network.ssh reachable.go:169 connected to [fd42:c24d:2827:4f4c:216:3eff:fe13:6951]:22, initiating ssh handshake
06:37:43 DEBUG juju.network.ssh reachable.go:110 host key for 10.198.60.86:22 not in our accepted set: log at TRACE to see raw keys
06:37:43 DEBUG juju.network.ssh reachable.go:110 host key for [fd42:c24d:2827:4f4c:216:3eff:fe13:6951]:22 not in our accepted set: log at TRACE to see raw keys
06:37:46 DEBUG juju.network.ssh reachable.go:159 dial 10.198.60.87:22 failed with: dial tcp 10.198.60.87:22: connect: no route to host
06:37:46 DEBUG juju.cmd.juju.ssh ssh_machine.go:491 getting target "2" address(es) failed: cannot connect to any address: [10.198.60.86:22 10.198.60.87:22 [fd42:c24d:2827:4f4c:216:3eff:fe13:6951]:22] (retrying)
06:37:48 DEBUG juju.network.ssh reachable.go:156 dialing [fd42:c24d:2827:4f4c:216:3eff:fe13:6951]:22 to check host keys
06:37:48 DEBUG juju.network.ssh reachable.go:156 dialing 10.198.60.86:22 to check host keys
06:37:48 DEBUG juju.network.ssh reachable.go:156 dialing 10.198.60.87:22 to check host keys
06:37:48 DEBUG juju.network.ssh reachable.go:169 connected to [fd42:c24d:2827:4f4c:216:3eff:fe13:6951]:22, initiating ssh handshake
06:37:48 DEBUG juju.network.ssh reachable.go:169 connected to 10.198.60.86:22, initiating ssh handshake
06:37:48 DEBUG juju.network.ssh reachable.go:110 host key for 10.198.60.86:22 not in our accepted set: log at TRACE to see raw keys
06:37:48 DEBUG juju.network.ssh reachable.go:110 host key for [fd42:c24d:2827:4f4c:216:3eff:fe13:6951]:22 not in our accepted set: log at TRACE to see raw keys
06:37:51 DEBUG juju.network.ssh reachable.go:159 dial 10.198.60.87:22 failed with: dial tcp 10.198.60.87:22: connect: no route to host
06:37:51 DEBUG juju.cmd.juju.ssh ssh_machine.go:491 getting target "2" address(es) failed: cannot connect to any address: [10.198.60.86:22 10.198.60.87:22 [fd42:c24d:2827:4f4c:216:3eff:fe13:6951]:22] (retrying)
06:37:53 DEBUG juju.network.ssh reachable.go:156 dialing [fd42:c24d:2827:4f4c:216:3eff:fe13:6951]:22 to check host keys
06:37:53 DEBUG juju.network.ssh reachable.go:156 dialing 10.198.60.87:22 to check host keys
06:37:53 DEBUG juju.network.ssh reachable.go:156 dialing 10.198.60.86:22 to check host keys
06:37:53 DEBUG juju.network.ssh reachable.go:169 connected to 10.198.60.86:22, initiating ssh handshake
06:37:53 DEBUG juju.network.ssh reachable.go:169 connected to [fd42:c24d:2827:4f4c:216:3eff:fe13:6951]:22, initiating ssh handshake
06:37:53 DEBUG juju.network.ssh reachable.go:110 host key for [fd42:c24d:2827:4f4c:216:3eff:fe13:6951]:22 not in our accepted set: log at TRACE to see raw keys
06:37:53 DEBUG juju.network.ssh reachable.go:110 host key for 10.198.60.86:22 not in our accepted set: log at TRACE to see raw keys

But juju --debug ssh --no-host-key-checks 2 works:

johnbendi@generalbendi:~$ juju --debug ssh --no-host-key-checks 2
06:41:38 INFO  juju.cmd supercommand.go:56 running juju [3.6.1 cdb5fe45b78a4701a8bc8369c5a50432358afbd3 gc go1.23.3]
06:41:38 DEBUG juju.cmd supercommand.go:57   args: []string{"/snap/juju/29241/bin/juju", "--debug", "ssh", "--no-host-key-checks", "2"}
06:41:38 INFO  juju.juju api.go:86 connecting to API addresses: [10.198.60.170:17070 [fd42:c24d:2827:4f4c:216:3eff:fe98:9af0]:17070]
06:41:38 DEBUG juju.api apiclient.go:1035 successfully dialed "wss://10.198.60.170:17070/model/f7411ed0-3b7a-44c6-899d-b8ed029317c7/api"
06:41:38 INFO  juju.api apiclient.go:570 connection established to "wss://10.198.60.170:17070/model/f7411ed0-3b7a-44c6-899d-b8ed029317c7/api"
06:41:38 DEBUG juju.cmd.juju.ssh ssh_machine.go:345 proxy-ssh is false
06:41:38 DEBUG juju.network.ssh reachable.go:156 dialing [fd42:c24d:2827:4f4c:216:3eff:fe13:6951]:22 to check host keys
06:41:38 DEBUG juju.network.ssh reachable.go:156 dialing 10.198.60.87:22 to check host keys
06:41:38 DEBUG juju.network.ssh reachable.go:156 dialing 10.198.60.86:22 to check host keys
06:41:38 DEBUG juju.network.ssh reachable.go:169 connected to [fd42:c24d:2827:4f4c:216:3eff:fe13:6951]:22, initiating ssh handshake
06:41:38 DEBUG juju.network.ssh reachable.go:169 connected to 10.198.60.86:22, initiating ssh handshake
06:41:38 DEBUG juju.network.ssh reachable.go:98 accepted host key for: [fd42:c24d:2827:4f4c:216:3eff:fe13:6951]:22
06:41:38 INFO  juju.network.ssh reachable.go:223 found [fd42:c24d:2827:4f4c:216:3eff:fe13:6951]:22 has an acceptable ssh key
06:41:38 DEBUG juju.cmd.juju.ssh ssh_machine.go:495 using target "2" address "fd42:c24d:2827:4f4c:216:3eff:fe13:6951"
06:41:38 DEBUG juju.utils.ssh ssh.go:305 using OpenSSH ssh client
06:41:38 DEBUG juju.network.ssh reachable.go:98 accepted host key for: 10.198.60.86:22
06:41:38 DEBUG juju.network.ssh reachable.go:181 ssh: handshake failed: host key was accepted, but search was stopped
Warning: Permanently added 'fd42:c24d:2827:4f4c:216:3eff:fe13:6951' (ED25519) to the list of known hosts.
Authenticated to fd42:c24d:2827:4f4c:216:3eff:fe13:6951 ([fd42:c24d:2827:4f4c:216:3eff:fe13:6951]:22) using "publickey".
Last login: Fri Jan 10 09:19:12 2025 from 10.198.60.1

ubuntu@juju-9317c7-2:~$ 

The machine-8 agent log looks similar to machine-0:

2025-01-11 05:43:23 INFO juju.cmd supercommand.go:56 running jujud [3.6.0 6dec77a01480916689430d38ef3e032cb1e06b78 gc go1.23.3]
2025-01-11 05:43:23 DEBUG juju.cmd supercommand.go:57   args: []string{"/var/lib/juju/tools/machine-2/jujud", "machine", "--data-dir", "/var/lib/juju", "--machine-id", "2", "--debug"}
2025-01-11 05:43:23 DEBUG juju.utils gomaxprocs.go:24 setting GOMAXPROCS to 2
2025-01-11 05:43:23 DEBUG juju.agent agent.go:600 read agent config, format "2.0"
2025-01-11 05:43:23 INFO juju.worker.upgradesteps worker.go:60 upgrade steps for 3.6.0 have already been run.
2025-01-11 05:43:23 DEBUG juju.cmd.jujud runner.go:416 start "engine"
2025-01-11 05:43:23 INFO juju.cmd.jujud runner.go:592 start "engine"
2025-01-11 05:43:23 DEBUG juju.worker.dependency engine.go:580 "syslog" manifold worker started at 2025-01-11 05:43:23.109597003 +0000 UTC
2025-01-11 05:43:23 DEBUG juju.worker.dependency engine.go:580 "upgrade-check-gate" manifold worker started at 2025-01-11 05:43:23.109872965 +0000 UTC
2025-01-11 05:43:23 DEBUG juju.worker.dependency engine.go:580 "agent" manifold worker started at 2025-01-11 05:43:23.110321085 +0000 UTC
2025-01-11 05:43:23 DEBUG juju.worker.dependency engine.go:580 "charmhub-http-client" manifold worker started at 2025-01-11 05:43:23.110764418 +0000 UTC
2025-01-11 05:43:23 DEBUG juju.worker.dependency engine.go:580 "clock" manifold worker started at 2025-01-11 05:43:23.110947983 +0000 UTC
2025-01-11 05:43:23 DEBUG juju.worker.dependency engine.go:580 "api-config-watcher" manifold worker started at 2025-01-11 05:43:23.111502164 +0000 UTC
2025-01-11 05:43:23 DEBUG juju.worker.dependency engine.go:580 "upgrade-steps-gate" manifold worker started at 2025-01-11 05:43:23.111610909 +0000 UTC
2025-01-11 05:43:23 DEBUG juju.worker.dependency engine.go:580 "termination-signal-handler" manifold worker started at 2025-01-11 05:43:23.112374698 +0000 UTC
2025-01-11 05:43:23 DEBUG juju.worker.introspection worker.go:125 introspection worker listening on "/var/lib/juju/agents/machine-2/introspection.socket"
2025-01-11 05:43:23 DEBUG juju.cmd.jujud runner.go:424 "engine" started
2025-01-11 05:43:23 DEBUG juju.worker.introspection worker.go:150 stats worker now serving
2025-01-11 05:43:23 DEBUG juju.worker.dependency engine.go:580 "upgrade-check-flag" manifold worker started at 2025-01-11 05:43:23.119100163 +0000 UTC
2025-01-11 05:43:23 DEBUG juju.worker.dependency engine.go:580 "state-config-watcher" manifold worker started at 2025-01-11 05:43:23.120178505 +0000 UTC
2025-01-11 05:43:23 DEBUG juju.worker.apicaller connect.go:129 connecting with old password
2025-01-11 05:43:23 DEBUG juju.worker.dependency engine.go:580 "upgrade-steps-flag" manifold worker started at 2025-01-11 05:43:23.12356202 +0000 UTC
2025-01-11 05:43:23 DEBUG juju.api apiclient.go:1036 successfully dialed "wss://[fd42:c24d:2827:4f4c:216:3eff:fe98:9af0]:17070/model/f7411ed0-3b7a-44c6-899d-b8ed029317c7/api"
2025-01-11 05:43:23 INFO juju.api apiclient.go:571 connection established to "wss://[fd42:c24d:2827:4f4c:216:3eff:fe98:9af0]:17070/model/f7411ed0-3b7a-44c6-899d-b8ed029317c7/api"
2025-01-11 05:43:23 DEBUG juju.worker.apicaller connect.go:160 [f7411e] failed to connect
2025-01-11 05:43:23 ERROR juju.worker.apicaller connect.go:209 Failed to connect to controller: invalid entity name or password (unauthorized access)
2025-01-11 05:43:23 DEBUG juju.worker.dependency engine.go:618 "api-caller" manifold worker stopped: [f7411e] "machine-2" cannot open api: connection permanently impossible
stack trace:
github.com/juju/juju/worker/apicaller.init:42: connection permanently impossible
github.com/juju/juju/cmd/jujud/agent/machine.commonManifolds.Manifold.ManifoldConfig.startFunc.func35:97: [f7411e] "machine-2" cannot open api
2025-01-11 05:43:23 DEBUG juju.worker.dependency engine.go:603 "upgrade-steps-flag" manifold worker completed successfully
2025-01-11 05:43:23 DEBUG juju.worker.dependency engine.go:603 "upgrade-steps-gate" manifold worker completed successfully
2025-01-11 05:43:23 DEBUG juju.worker.dependency engine.go:603 "syslog" manifold worker completed successfully
2025-01-11 05:43:23 DEBUG juju.worker.dependency engine.go:603 "termination-signal-handler" manifold worker completed successfully
2025-01-11 05:43:23 DEBUG juju.worker.dependency engine.go:603 "upgrade-check-flag" manifold worker completed successfully
2025-01-11 05:43:23 DEBUG juju.worker.dependency engine.go:603 "agent" manifold worker completed successfully
2025-01-11 05:43:23 DEBUG juju.worker.dependency engine.go:603 "charmhub-http-client" manifold worker completed successfully
2025-01-11 05:43:23 DEBUG juju.worker.dependency engine.go:603 "clock" manifold worker completed successfully
2025-01-11 05:43:23 DEBUG juju.worker.dependency engine.go:603 "upgrade-check-gate" manifold worker completed successfully
2025-01-11 05:43:23 DEBUG juju.worker.dependency engine.go:603 "api-config-watcher" manifold worker completed successfully
2025-01-11 05:43:23 INFO juju.worker.stateconfigwatcher manifold.go:120 tomb dying
2025-01-11 05:43:23 DEBUG juju.worker.dependency engine.go:603 "state-config-watcher" manifold worker completed successfully
2025-01-11 05:43:23 INFO juju.cmd.jujud runner.go:623 stopped "engine", err: agent should be terminated
2025-01-11 05:43:23 DEBUG juju.cmd.jujud runner.go:430 "engine" done: agent should be terminated
2025-01-11 05:43:23 DEBUG juju.cmd.jujud runner.go:485 error "engine": agent should be terminated
2025-01-11 05:43:23 DEBUG juju.cmd.jujud.agent.addons addons.go:77 engine stopped, stopping introspection
2025-01-11 05:43:23 ERROR juju.cmd.jujud runner.go:487 fatal error "engine": agent should be terminated
2025-01-11 05:43:23 DEBUG juju.worker.introspection worker.go:159 stats worker closing listener
2025-01-11 05:43:23 INFO cmd supercommand.go:556 command finished
2025-01-11 05:43:23 DEBUG juju.worker.introspection worker.go:163 stats worker finished
2025-01-11 05:43:23 DEBUG juju.cmd.jujud.agent.addons addons.go:80 introspection stopped
2025-01-11 05:43:23 DEBUG juju.cmd.jujud main.go:286 jujud complete, code 0, err <nil>

Hmm, looks like the failing machines are connecting correctly to the controllers API, but being denied access, either because the controller has no record of the the machine (identified by its number), or because the machine is trying to access the controller with a bad password.

According to the status it seems the controller knows about the machines, so I can only assume that somehow they have ended up with a bad password.

To confirm this, would you be able to check if the apipassword and oldpassword in the file /var/lib/juju/agents/unit<appname>-<unit-num>/agent.conf for the broken units differ from the passwords in the working units? (though don’t post them because they could be sensative info!)

@aflynn thanks. I have actually replaced the failing machines but next time it happens I’ll refer here to test if the password could be the issue.

Fair, that is certainly the best work around.