ceph-mon, one unit lost quorom after reboot, other 2 fine

ajames-tibus · 18 February 2025 08:48

Had some issues with one controller node that needed rebooted due to being out of RAM. (upgrades are on the way).

Since rebooting that controller node, all services recovered except ceph-mon unit that was on that host. Repeated restarts of the unit have not solved. The other 2 ceph-mon units appear fine.

ceph-mon 18.2.4 blocked 2/3 ceph-mon reef/stable 229 no Unit not clustered (no quorum)

2025-02-18T09:13:59.709+0000 7fe7adcae640  1 mon.juju-e6c4b6-2-lxd-2@0(probing) e2 handle_auth_request failed to assign global_id
2025-02-18T09:14:00.021+0000 7fe7ac4ab640 -1 mon.juju-e6c4b6-2-lxd-2@0(probing) e2 get_health_metrics reporting 30 slow ops, oldest is log(1 entries from seq 1 at 2025-02-18T09:03:11.318704+0000)
2025-02-18T09:14:01.073+0000 7fe7adcae640  1 mon.juju-e6c4b6-2-lxd-2@0(probing) e2 handle_auth_request failed to assign global_id
2025-02-18T09:14:02.741+0000 7fe7adcae640  1 mon.juju-e6c4b6-2-lxd-2@0(probing) e2 handle_auth_request failed to assign global_id
2025-02-18T09:14:02.753+0000 7fe7adcae640  1 mon.juju-e6c4b6-2-lxd-2@0(probing) e2 handle_auth_request failed to assign global_id

chrome0 · 18 February 2025 17:30

It looks like ceph-mon/2 isn’t able to cluster up. Couple things to look at:

Can ceph-mon/2 still reach the other mons network wise? any port filtering going on, ipaddr or MTU changes?
Output from sudo ceph -s (from another mon)?
Do logs from other mons mention clustering attempts from ceph-mon/2?

ajames-tibus · 4 March 2025 09:32

The network looked fine, no IP or MTU changes. In the end I deployed a new unit to the same machine and deleted the broken one