ceph-mon, one unit lost quorom after reboot, other 2 fine

Had some issues with one controller node that needed rebooted due to being out of RAM. (upgrades are on the way).

Since rebooting that controller node, all services recovered except ceph-mon unit that was on that host. Repeated restarts of the unit have not solved. The other 2 ceph-mon units appear fine.

ceph-mon 18.2.4 blocked 2/3 ceph-mon reef/stable 229 no Unit not clustered (no quorum)

2025-02-18T09:13:59.709+0000 7fe7adcae640  1 mon.juju-e6c4b6-2-lxd-2@0(probing) e2 handle_auth_request failed to assign global_id
2025-02-18T09:14:00.021+0000 7fe7ac4ab640 -1 mon.juju-e6c4b6-2-lxd-2@0(probing) e2 get_health_metrics reporting 30 slow ops, oldest is log(1 entries from seq 1 at 2025-02-18T09:03:11.318704+0000)
2025-02-18T09:14:01.073+0000 7fe7adcae640  1 mon.juju-e6c4b6-2-lxd-2@0(probing) e2 handle_auth_request failed to assign global_id
2025-02-18T09:14:02.741+0000 7fe7adcae640  1 mon.juju-e6c4b6-2-lxd-2@0(probing) e2 handle_auth_request failed to assign global_id
2025-02-18T09:14:02.753+0000 7fe7adcae640  1 mon.juju-e6c4b6-2-lxd-2@0(probing) e2 handle_auth_request failed to assign global_id

It looks like ceph-mon/2 isn’t able to cluster up. Couple things to look at:

  • Can ceph-mon/2 still reach the other mons network wise? any port filtering going on, ipaddr or MTU changes?
  • Output from sudo ceph -s (from another mon)?
  • Do logs from other mons mention clustering attempts from ceph-mon/2?

The network looked fine, no IP or MTU changes. In the end I deployed a new unit to the same machine and deleted the broken one :confused: