Problem with MaaS/Masakari Charm

Hi,

I’m not sure this is the right place to ask so I put this here and I’ll post the same request on the MaaS Discourse.

I have a strange issue with latest MaaS release and/or openstack Charms (probably Masakari since this is the only one that interacts with MaaS)
here is the case : I deployed a working Openstack cluster with Masakari enabled
Before, when I was testing Masakari, I just initiated a “reboot” on a compute node and the node simply restarted
Now, when I do the same thing, the node is stuck shutoff (those are physical nodes with MaaS provider)
When I try to start them through MaaS (so through ipmi), the node starts then as soon as it try to start PXE, it simply shutdown itself, it’s not even initiating a DHCP/PXE request, simply shuting down before the bios ends its initialization phase

I tried to start those nodes many times through MaaS and it always do the same (nodes are Dell PowerEdge R630)
Only way to start those nodes is to start them from the idrac console
It might not be a “Charms” related issue since there was also new MaaS releases recently but since there are interaction between the Masakari charm and MaaS, I don’t really know where to route that issue
So any help gladly appreciated.

I might suggest checking your notifications list to ensure that masakari has completed the failover for the host. It may be that masakari-monitors is still STONITHing the node because instance recovery is still happening, but that seems odd.

Can you see in the machine activity logs that your masakari user (potentially the same user) is powering off the machine? If so, walk through items like setting the node into maintenance mode in masakari, or checking the health of the masakari-monitors pacemaker cluster.

What I’ve definitely found is you must power on through maas, as if maas thinks the node is supposed to be offline (as masakari STONITH will set the status as such), you cannot get it to power on past the pxe boot.

What is the status of the machine in question in MAAS? is it Deployed/Powered-off, or is it “Ready” and as such, has no reason to be on because MAAS thinks it’s not supposed to be running it’s installed OS.

Well, here is what I did :

  • restarted the machine through “juju ssh sudo reboot”
  • machine reboot as normal except that it seems STONITHed as you say because when restarting, it doesn’t pass the BIOS init and shut itself down before even entering the DHCP/PXE step
  • in MaaS, machine is marked as Off
  • I tried to start the machine from MaaS and it fails the same way (shutdown right after the BIOS init, not going through the DHCP/PXE step)
  • then I started the machine through the idrac console (this is a Dell server), and there, it works.

You can find attached my pacemaker.log, the machine I rebooted is called “lab06.maas” and you can see the correspondif acctivity in the log at “Feb 25 12:07:07” (line number : 6901)

I see many errors regarding “logd is not running”, don’t know if it is worrying or not.
Just to be clear, this is a test lab with a fresh opoenstack cluster deployed from scratch yesterday so this is brand new with latest stable charms (release 21.01 on Focal/Ussuri).

Hope this helps.

Regards.

Looking at the pastebin (thanks!), I can see that the pacemaker is attempting to perform a stonith via the MAAS api, but failing in its invocations. The “logd is not running” bit is not the concerning part in the logs, rather the concerning part is the tracebacks following, which indicate that the stonith commands are failing:

Feb 25 12:09:16 juju-de1a68-3-lxd-7 pacemaker-fenced [31093] (log_op_output) notice: fence_legacy_reboot_1[572338] error output [ logd is not runninglogd is not runninglogd is not runninglogd is not runninglogd is not runningTraceback (most recent call last): ]
Feb 25 12:09:16 juju-de1a68-3-lxd-7 pacemaker-fenced [31093] (log_op_output) notice: fence_legacy_reboot_1[572338] error output [ File “/usr/lib/stonith/plugins/external/maas”, line 380, in ]
Feb 25 12:09:16 juju-de1a68-3-lxd-7 pacemaker-fenced [31093] (log_op_output) notice: fence_legacy_reboot_1[572338] error output [ sys.exit(map_commands(sys.argv)) ]
Feb 25 12:09:16 juju-de1a68-3-lxd-7 pacemaker-fenced [31093] (log_op_output) notice: fence_legacy_reboot_1[572338] error output [ File “/usr/lib/stonith/plugins/external/maas”, line 373, in map_commands ]
Feb 25 12:09:16 juju-de1a68-3-lxd-7 pacemaker-fenced [31093] (log_op_output) notice: fence_legacy_reboot_1[572338] error output [ rc = commandscmd ]
Feb 25 12:09:16 juju-de1a68-3-lxd-7 pacemaker-fenced [31093] (log_op_output) notice: fence_legacy_reboot_1[572338] error output [ File “/usr/lib/stonith/plugins/external/maas”, line 244, in power_reset ]
Feb 25 12:09:16 juju-de1a68-3-lxd-7 pacemaker-fenced [31093] (log_op_output) notice: fence_legacy_reboot_1[572338] error output [ machine.power_on() ]
Feb 25 12:09:16 juju-de1a68-3-lxd-7 pacemaker-fenced [31093] (log_op_output) notice: fence_legacy_reboot_1[572338] error output [ File “/usr/lib/python3/dist-packages/maas/client/utils/maas_async.py”, line 49, in wrapper ]
Feb 25 12:09:16 juju-de1a68-3-lxd-7 pacemaker-fenced [31093] (log_op_output) notice: fence_legacy_reboot_1[572338] error output [ result = eventloop.run_until_complete(result) ]
Feb 25 12:09:16 juju-de1a68-3-lxd-7 pacemaker-fenced [31093] (log_op_output) notice: fence_legacy_reboot_1[572338] error output [ File “/usr/lib/python3.8/asyncio/base_events.py”, line 616, in run_until_complete ]
Feb 25 12:09:16 juju-de1a68-3-lxd-7 pacemaker-fenced [31093] (log_op_output) notice: fence_legacy_reboot_1[572338] error output [ return future.result() ]
Feb 25 12:09:16 juju-de1a68-3-lxd-7 pacemaker-fenced [31093] (log_op_output) notice: fence_legacy_reboot_1[572338] error output [ File “/usr/lib/python3/dist-packages/maas/client/viscera/machines.py”, line 713, in power_on ]
Feb 25 12:09:16 juju-de1a68-3-lxd-7 pacemaker-fenced [31093] (log_op_output) notice: fence_legacy_reboot_1[572338] error output [ self._reset(await self._handler.power_on(**params)) ]
Feb 25 12:09:16 juju-de1a68-3-lxd-7 pacemaker-fenced [31093] (log_op_output) notice: fence_legacy_reboot_1[572338] error output [ File “/usr/lib/python3/dist-packages/maas/client/bones/init.py”, line 316, in call ]
Feb 25 12:09:16 juju-de1a68-3-lxd-7 pacemaker-fenced [31093] (log_op_output) notice: fence_legacy_reboot_1[572338] error output [ response = await self.bind(**params).call(**data) ]
Feb 25 12:09:16 juju-de1a68-3-lxd-7 pacemaker-fenced [31093] (log_op_output) notice: fence_legacy_reboot_1[572338] error output [ File “/usr/lib/python3/dist-packages/maas/client/bones/init.py”, line 461, in dispatch ]
Feb 25 12:09:16 juju-de1a68-3-lxd-7 pacemaker-fenced [31093] (log_op_output) notice: fence_legacy_reboot_1[572338] error output [ response = await session.request( ]
Feb 25 12:09:16 juju-de1a68-3-lxd-7 pacemaker-fenced [31093] (log_op_output) notice: fence_legacy_reboot_1[572338] error output [ File “/usr/lib/python3/dist-packages/aiohttp/client.py”, line 504, in _request ]
Feb 25 12:09:16 juju-de1a68-3-lxd-7 pacemaker-fenced [31093] (log_op_output) notice: fence_legacy_reboot_1[572338] error output [ await resp.start(conn) ]
Feb 25 12:09:16 juju-de1a68-3-lxd-7 pacemaker-fenced [31093] (log_op_output) notice: fence_legacy_reboot_1[572338] error output [ File “/usr/lib/python3/dist-packages/aiohttp/client_reqrep.py”, line 847, in start ]
Feb 25 12:09:16 juju-de1a68-3-lxd-7 pacemaker-fenced [31093] (log_op_output) notice: fence_legacy_reboot_1[572338] error output [ message, payload = await self._protocol.read() # type: ignore # noqa ]
Feb 25 12:09:16 juju-de1a68-3-lxd-7 pacemaker-fenced [31093] (log_op_output) notice: fence_legacy_reboot_1[572338] error output [ File “/usr/lib/python3/dist-packages/aiohttp/streams.py”, line 591, in read ]
Feb 25 12:09:16 juju-de1a68-3-lxd-7 pacemaker-fenced [31093] (log_op_output) notice: fence_legacy_reboot_1[572338] error output [ await self._waiter ]
Feb 25 12:09:16 juju-de1a68-3-lxd-7 pacemaker-fenced [31093] (log_op_output) notice: fence_legacy_reboot_1[572338] error output [ aiohttp.client_exceptions.ServerDisconnectedError ]

I suspect what is happening is that the commands are going through on the MAAS server side, but the client is unable to interpret the results due to the error. Pacemaker is continually trying to stonith the node in this scenario, which is likely why you see the node going down.

May I ask what version of MAAS you are using in this deployment?

I see a similar issue reported in https://github.com/maas/python-libmaas/issues/251

Hi, thanks for answering.

I’m using latest stable MaaS release to date (2.9.2-9164-g.ac176b5c4) with latest stable Masakari Charm (cs:masakari-8).

Regards