Openstack Ceph-Mon Hook failed

Hej,

Trying to do a clean/fresh Openstack deployment, based on the stable bundle found in the gitrepo (openstack-bundles/stable/openstack-base/bundle.yaml). The deployment seems to be stuck on Ceph-mon/[0,1].

   Unit                         Workload     Agent      Machine  Public address  Ports               Message
   ceph-mon/0                   error        idle       0/lxd/0  10.30.0.81                          hook failed: "mon-relation-changed"
   ceph-mon/1                   error        idle       1/lxd/0  10.30.0.82                          hook failed: "mon-relation-changed"
   ceph-mon/2*                  maintenance  executing  2/lxd/0  10.30.0.67                          Bootstrapping MON cluster
   ceph-osd/0                   waiting      idle       0        10.30.0.63                          Incomplete relation: monitor
   ceph-osd/1                   waiting      idle       1        10.30.0.62                          Incomplete relation: monitor
   ceph-osd/2*                  waiting      idle       2        10.30.0.64                          Incomplete relation: monitor
   ceph-radosgw/0*              waiting      idle       0/lxd/1  10.30.0.74      80/tcp              Incomplete relations: mon

When I look at juju debug-log --include ceph-mon/0, I get this.

unit-ceph-mon-0: 17:29:07 INFO unit.ceph-mon/0.juju-log mon:0: Making dir /var/lib/ceph/mon/ceph-juju-a1b0c7-0-lxd-0 ceph:ceph 755
unit-ceph-mon-0: 17:29:07 DEBUG unit.ceph-mon/0.mon-relation-changed creating /var/lib/ceph/tmp/juju-a1b0c7-0-lxd-0.mon.keyring
unit-ceph-mon-0: 17:29:07 DEBUG unit.ceph-mon/0.mon-relation-changed added entity mon. auth(key=AQBv1TtjtNmVChAAbD2smDrDflB6OyBwf07tWQ==)
unit-ceph-mon-0: 17:29:07 DEBUG unit.ceph-mon/0.mon-relation-changed added 1 caps to entity mon.
unit-ceph-mon-0: 17:29:08 WARNING unit.ceph-mon/0.mon-relation-changed Traceback (most recent call last):
unit-ceph-mon-0: 17:29:08 WARNING unit.ceph-mon/0.mon-relation-changed   File "/var/lib/juju/agents/unit-ceph-mon-0/charm/hooks/mon-relation-changed", line 1362, in <module>
unit-ceph-mon-0: 17:29:08 WARNING unit.ceph-mon/0.mon-relation-changed     hooks.execute(sys.argv)
unit-ceph-mon-0: 17:29:08 WARNING unit.ceph-mon/0.mon-relation-changed   File "/var/lib/juju/agents/unit-ceph-mon-0/charm/hooks/charmhelpers/core/hookenv.py", line 963, in execute
unit-ceph-mon-0: 17:29:08 WARNING unit.ceph-mon/0.mon-relation-changed     self._hooks[hook_name]()
unit-ceph-mon-0: 17:29:08 WARNING unit.ceph-mon/0.mon-relation-changed   File "/var/lib/juju/agents/unit-ceph-mon-0/charm/hooks/mon-relation-changed", line 492, in mon_relation
unit-ceph-mon-0: 17:29:08 WARNING unit.ceph-mon/0.mon-relation-changed     if attempt_mon_cluster_bootstrap():
unit-ceph-mon-0: 17:29:08 WARNING unit.ceph-mon/0.mon-relation-changed   File "/var/lib/juju/agents/unit-ceph-mon-0/charm/hooks/mon-relation-changed", line 504, in attempt_mon_cluster_bootstrap
unit-ceph-mon-0: 17:29:08 WARNING unit.ceph-mon/0.mon-relation-changed     ceph.bootstrap_monitor_cluster(leader_get('monitor-secret'))
unit-ceph-mon-0: 17:29:08 WARNING unit.ceph-mon/0.mon-relation-changed   File "/var/lib/juju/agents/unit-ceph-mon-0/charm/lib/charms_ceph/utils.py", line 1331, in bootstrap_monitor_cluster
unit-ceph-mon-0: 17:29:08 WARNING unit.ceph-mon/0.mon-relation-changed     _create_monitor(keyring,
unit-ceph-mon-0: 17:29:08 WARNING unit.ceph-mon/0.mon-relation-changed   File "/var/lib/juju/agents/unit-ceph-mon-0/charm/lib/charms_ceph/utils.py", line 1363, in _create_monitor
unit-ceph-mon-0: 17:29:08 WARNING unit.ceph-mon/0.mon-relation-changed     subprocess.check_call(['ceph-mon', '--mkfs',
unit-ceph-mon-0: 17:29:08 WARNING unit.ceph-mon/0.mon-relation-changed   File "/usr/lib/python3.8/subprocess.py", line 364, in check_call
unit-ceph-mon-0: 17:29:08 WARNING unit.ceph-mon/0.mon-relation-changed     raise CalledProcessError(retcode, cmd)
unit-ceph-mon-0: 17:29:08 WARNING unit.ceph-mon/0.mon-relation-changed subprocess.CalledProcessError: Command '['ceph-mon', '--mkfs', '-i', 'juju-a1b0c7-0-lxd-0', '--keyring', '/var/lib/ceph/tmp/juju-a1b0c7-0-lxd-0.mon.keyring']' died with <Signals.SIGILL: 4>.
unit-ceph-mon-0: 17:29:08 TRACE juju.worker.uniter.context checking for reboot request
unit-ceph-mon-0: 17:29:08 ERROR juju.worker.uniter.operation hook "mon-relation-changed" (via explicit, bespoke hook script) failed: exit status 1
unit-ceph-mon-0: 17:29:08 DEBUG juju.machinelock created rotating log file "/var/log/juju/machine-lock.log" with max size 10 MB and max backups 5

There seems to be something going bad with the execution of ceph-mon --mkfs …

I tried to execute that command manually on the ceph-mon/0 node, but it just replied with “Illegal instruction”.

The same applies to ceph-mon/1.

ceph-mon/2 seems to be fine, it is in “Waiting for quorum to be reached”.

Any idea whats wrong? Have I missed some requirement on the nodes?

BR/Patrik

Can you share more details about what kind of hardware you’re running on? Looking through Ceph’s bug tracker suggests that there have historically been issues on random CPUs causing illegal instruction errors

All nodes are Dell PowerEdge R415.

ceph-mon/0 - AMD Opteron 4133, 32GB ram.
ceph-mon/1 - AMD Opteron 4133, 32GB ram.
ceph-mon/2 - AMD Opteron 4386, 32GB ram.

If my bit of reading is correct, those 4133s are ~12 years old; there are both older and current bugs in Ceph that seem to relate to running on older hardware, including this fairly new one for radosgw: https://tracker.ceph.com/issues/55859

Hej,

Thanks for the digging and feedback. The platforms are ~9 years, but perhaps we bought the last round before they were discontinued.

Alright, then have to add a tag noCEPH to those nodes, so they are not selected for CEPH applications/models. Perhaps something that juju can check (if machine[CPU] in list of BADcpus, then next), at deployment time.

BR/Patrik