Can't add etcd unit: “waiting to retry etcd registration”

When migrating over from ubuntu 18.04 series, to 20.04, it was time to migrate the first etcd unit in my small kubernetes cluster.

However I’m stuck with the following:

Note the “waiting to retry etcd registration”.

The debug-log of that unit says:

unit-etcd-50: 19:15:57 INFO unit.etcd/50.juju-log Reactive main running for hook update-status
unit-etcd-50: 19:15:58 INFO unit.etcd/50.juju-log Initializing Leadership Layer (is follower)
unit-etcd-50: 19:15:58 INFO unit.etcd/50.juju-log Initializing Snap Layer
unit-etcd-50: 19:15:58 INFO unit.etcd/50.juju-log Invoking reactive handler: reactive/tls_client.py:18:store_ca
unit-etcd-50: 19:15:58 INFO unit.etcd/50.juju-log Invoking reactive handler: reactive/tls_client.py:44:store_server
unit-etcd-50: 19:15:58 INFO unit.etcd/50.juju-log Invoking reactive handler: reactive/tls_client.py:71:store_client
unit-etcd-50: 19:15:58 INFO unit.etcd/50.juju-log Invoking reactive handler: reactive/etcd.py:143:set_app_version
unit-etcd-50: 19:15:58 INFO unit.etcd/50.juju-log Invoking reactive handler: reactive/etcd.py:157:prepare_tls_certificates
unit-etcd-50: 19:15:58 INFO unit.etcd/50.juju-log Invoking reactive handler: reactive/etcd.py:282:set_db_ingress_address
unit-etcd-50: 19:15:58 INFO unit.etcd/50.juju-log Invoking reactive handler: reactive/etcd.py:289:send_cluster_connection_details
unit-etcd-50: 19:15:58 INFO unit.etcd/50.juju-log Invoking reactive handler: reactive/etcd.py:435:register_node_with_leader
unit-etcd-50: 19:15:58 ERROR unit.etcd/50.juju-log ['/snap/bin/etcd.etcdctl', '--endpoint', 'https://192.168.1.222:2379', 'member', 'list']
unit-etcd-50: 19:15:59 ERROR unit.etcd/50.juju-log {'ETCDCTL_API': '2', 'ETCDCTL_CA_FILE': '/var/snap/etcd/common/ca.crt', 'ETCDCTL_CERT_FILE': '/var/snap/etcd/common/server.crt', 'ETCDCTL_KEY_FILE': '/var/snap/etcd/common/server.key'}
unit-etcd-50: 19:15:59 ERROR unit.etcd/50.juju-log b'Error:  client: etcd cluster is unavailable or misconfigured; error #0: dial tcp 192.168.1.222:2379: connect: connection refused\n\nerror #0: dial tcp 192.168.1.222:2379: connect: connection refused\n\n'
unit-etcd-50: 19:15:59 ERROR unit.etcd/50.juju-log None
unit-etcd-50: 19:15:59 INFO unit.etcd/50.juju-log etcdctl.register failed, will retry
unit-etcd-50: 19:15:59 INFO unit.etcd/50.juju-log Invoking reactive handler: hooks/relations/tls-certificates/requires.py:79:joined:certificates
unit-etcd-50: 19:15:59 INFO unit.etcd/50.juju-log status-set: waiting: Waiting to retry etcd registration
unit-etcd-50: 19:15:59 INFO juju.worker.uniter.operation ran "update-status" hook (via explicit, bespoke hook script)
unit-etcd-50: 19:21:57 INFO unit.etcd/50.juju-log Reactive main running for hook update-status
unit-etcd-50: 19:21:57 INFO unit.etcd/50.juju-log Initializing Leadership Layer (is follower)
unit-etcd-50: 19:21:57 INFO unit.etcd/50.juju-log Initializing Snap Layer
unit-etcd-50: 19:21:57 INFO unit.etcd/50.juju-log Invoking reactive handler: reactive/tls_client.py:18:store_ca
unit-etcd-50: 19:21:57 INFO unit.etcd/50.juju-log Invoking reactive handler: reactive/tls_client.py:44:store_server
unit-etcd-50: 19:21:57 INFO unit.etcd/50.juju-log Invoking reactive handler: reactive/tls_client.py:71:store_client
unit-etcd-50: 19:21:57 INFO unit.etcd/50.juju-log Invoking reactive handler: reactive/etcd.py:143:set_app_version
unit-etcd-50: 19:21:57 INFO unit.etcd/50.juju-log Invoking reactive handler: reactive/etcd.py:157:prepare_tls_certificates
unit-etcd-50: 19:21:58 INFO unit.etcd/50.juju-log Invoking reactive handler: reactive/etcd.py:282:set_db_ingress_address
unit-etcd-50: 19:21:58 INFO unit.etcd/50.juju-log Invoking reactive handler: reactive/etcd.py:289:send_cluster_connection_details
unit-etcd-50: 19:21:58 INFO unit.etcd/50.juju-log Invoking reactive handler: reactive/etcd.py:435:register_node_with_leader
unit-etcd-50: 19:21:58 ERROR unit.etcd/50.juju-log ['/snap/bin/etcd.etcdctl', '--endpoint', 'https://192.168.1.222:2379', 'member', 'list']
unit-etcd-50: 19:21:58 ERROR unit.etcd/50.juju-log {'ETCDCTL_API': '2', 'ETCDCTL_CA_FILE': '/var/snap/etcd/common/ca.crt', 'ETCDCTL_CERT_FILE': '/var/snap/etcd/common/server.crt', 'ETCDCTL_KEY_FILE': '/var/snap/etcd/common/server.key'}
unit-etcd-50: 19:21:58 ERROR unit.etcd/50.juju-log b'Error:  client: etcd cluster is unavailable or misconfigured; error #0: dial tcp 192.168.1.222:2379: connect: connection refused\n\nerror #0: dial tcp 192.168.1.222:2379: connect: connection refused\n\n'
unit-etcd-50: 19:21:58 ERROR unit.etcd/50.juju-log None
unit-etcd-50: 19:21:58 INFO unit.etcd/50.juju-log etcdctl.register failed, will retry
unit-etcd-50: 19:21:58 INFO unit.etcd/50.juju-log Invoking reactive handler: hooks/relations/tls-certificates/requires.py:79:joined:certificates
unit-etcd-50: 19:21:58 INFO unit.etcd/50.juju-log status-set: waiting: Waiting to retry etcd registration
unit-etcd-50: 19:21:58 INFO juju.worker.uniter.operation ran "update-status" hook (via explicit, bespoke hook script)

The etcd logs on the machine itself with journalctl:

Mar 23 18:26:35 node1 systemd[1]: snap.etcd.etcd.service: Scheduled restart job, restart counter is at 1397.
Mar 23 18:26:35 node1 systemd[1]: Stopped Service for snap application etcd.etcd.
Mar 23 18:26:35 node1 systemd[1]: Started Service for snap application etcd.etcd.
Mar 23 18:26:36 node1 etcd.etcd[2702955]: Running as system with data in /var/snap/etcd/233
Mar 23 18:26:36 node1 etcd.etcd[2702955]: Configuration from /var/snap/etcd/common/etcd.conf.yml
Mar 23 18:26:36 node1 etcd.etcd[2702955]: [WARNING] Deprecated '--logger=capnslog' flag is set; use '--logger=zap' flag instead
Mar 23 18:26:36 node1 etcd[2702955]: Loading server configuration from "/var/snap/etcd/common/etcd.conf.yml". Other configuration command line flags and environment variables will be ignored if provided.
Mar 23 18:26:36 node1 etcd[2702955]: etcd Version: 3.4.22
Mar 23 18:26:36 node1 etcd[2702955]: Git SHA: Not provided (use ./build instead of go build)
Mar 23 18:26:36 node1 etcd.etcd[2702955]: [WARNING] Deprecated '--logger=capnslog' flag is set; use '--logger=zap' flag instead
Mar 23 18:26:36 node1 etcd[2702955]: Go Version: go1.19.4
Mar 23 18:26:36 node1 etcd[2702955]: Go OS/Arch: linux/amd64
Mar 23 18:26:36 node1 etcd[2702955]: setting maximum number of CPUs to 4, total number of available CPUs is 4
Mar 23 18:26:36 node1 etcd[2702955]: the server is already initialized as member before, starting as etcd member...
Mar 23 18:26:36 node1 etcd[2702955]: peerTLS: cert = /var/snap/etcd/common/server.crt, key = /var/snap/etcd/common/server.key, trusted-ca = /var/snap/etcd/common/ca.crt, client-cert-auth = true, crl-file = 
Mar 23 18:26:36 node1 etcd[2702955]: The scheme of client url http://127.0.0.1:4001 is HTTP while peer key/cert files are presented. Ignored key/cert files.
Mar 23 18:26:36 node1 etcd[2702955]: The scheme of client url http://127.0.0.1:4001 is HTTP while client cert auth (--client-cert-auth) is enabled. Ignored client cert auth for this url.
Mar 23 18:26:36 node1 etcd[2702955]: name = etcd50
Mar 23 18:26:36 node1 etcd[2702955]: data dir = /var/snap/etcd/current
Mar 23 18:26:36 node1 etcd[2702955]: member dir = /var/snap/etcd/current/member
Mar 23 18:26:36 node1 etcd[2702955]: heartbeat = 100ms
Mar 23 18:26:36 node1 etcd[2702955]: election = 1000ms
Mar 23 18:26:36 node1 etcd[2702955]: snapshot count = 10000
Mar 23 18:26:36 node1 etcd[2702955]: advertise client URLs = https://192.168.1.222:2379
Mar 23 18:26:36 node1 etcd[2702955]: cannot fetch cluster info from peer urls: could not retrieve cluster information from the given URLs
Mar 23 18:26:36 node1 systemd[1]: snap.etcd.etcd.service: Main process exited, code=exited, status=1/FAILURE
Mar 23 18:26:36 node1 systemd[1]: snap.etcd.etcd.service: Failed with result 'exit-code'.

I am at my wit’s end. Any pointers? When/how should a newly added etcd unit know about its peers?

This is the wildest goose chase since I’ve been using Juju since 2017, and that’s saying a lot.

The unit-etcd-50: 19:21:58 ERROR unit.etcd/50.juju-log ['/snap/bin/etcd.etcdctl', '--endpoint', 'https://192.168.1.222:2379', 'member', 'list']

is the main interesting thing. The referenced IP there is the node itself, but it should be referencing the leader! So I looked up in the charm’s source where the leader_set is actually called. A convenient way seemed to do a config change on port and set it to 22379.

This triggered a new bug, as that config is not actually used in the leader_address, in fact is uses the management_port in that case:

That was another rabbit hole and gave many other errors, and in turn in my case deployed a disjoint single instance etcd, consequently when that new etcd node got leadership, my whole kubernetes instance went down the drain.

Finally I managed to set that right by a monkey patch. Then redeployed the etcd unit, and finally after many hours I got a healthy quorum again.

Is changing stuff like port numbers even integration tested? The charm’s statemachine is ripe for TLA+.