Hot standby; going from 2 -> 1 controller machines

hbogert · 9 March 2024 13:01

My previous post of roughly a year ago died silently: https://discourse.charmhub.io/t/standby-ha-does-not-work/9224

I ran into the above issue again, the documentation says I have a hot standby controller:

The output to juju controllers --refresh now becomes as below, where the HA column says 1/2 – that is, there is now a single active controller, with the remaining controller on standby.
Controller  Model    User   Access     Cloud/Region         Models  Machines    HA  Version>
aws-ha*     default  admin  superuser  aws/us-east-1             2         2   1/2  2.4-beta2

I have that exact case, though if the active controller goes down, the cluster is non-responsive. That doesn’t seem like a hot-standby.

The question is, is the documentation misleading in stating that a hot-standby works even when going from 2 machines to 1? Or should going from 2 to 1 machine actually work, and is my case different?

manadart · 27 March 2024 11:48

2.4-beta2 is a very old, and unsupported version.

Juju does have logic for back-stop logic if regressing from HA 2 → 1, but it is difficult to speak to it this far back in time.

hbogert · 27 March 2024 13:35

The reference to 2.4-beta2 is from the current documentation. If the documentation does not reflect the current expected behaviour, should it still be in the current documentation?

FTR i am on version 2.9.46

manadart · 2 April 2024 14:00

My mistake. I didn’t see that it was a quote from our docs. I’ll get that looked at.

When going from 2 machines to 1, was this done by removing the machine with Juju, or was the instance terminated out-of-band?

If we lose a single machine from HA-3, Juju will remove the voting rights from one of the machines (thus it becomes just a replication target).

If Juju sees the remaining machine is gone, it should do cluster maintenance again to make the single DB work.

If you’ve got into a situation where you need to start a remaining Juju controller, but that maintenance was not done when Juju was up, you are in a pickle, and need surgery to start Mongo.

I see by the logs that this node still thinks it is in a cluster with other members.

jameinel · 10 April 2024 21:42

“Hot Standby” in this condition is not that “if one machine completely fails it will automatically switch to the backup”, but if you chose to remove one machine, then juju can do the right thing and keep things working. The issue is that in HA=2, there is no way to detect “split brain”, and thus when a machine stops talking, all we can do is pause and wait for a human to come in and tell us that we are the only one left. That has to be done by directly connecting to mongo and updating its replica set to indicate that the other machine is gone.

In HA=3, Juju and Mongo can detect quorum, and know that no other machine is going to be responding to requests thinking that it holds quorum, and so juju can continue operating with that machine going down. However, you still need to bring it back, because losing another machine then means we again cannot tell if the machine is just unreachable from me, or whether the machine is completely gone.

Note that if you go in and juju remove-machine 0 that is very different from aws terminate id-XXXX. In the former, you have told juju to do what it needs to for the machine to be taken out of service, and to no longer consider it part of the quorum set. In the latter, Juju cannot see the difference from a net split (your firewall got configured to block traffic between machine 1 and machine 0) and the machine is fully dead and gone.

So there is a “hot spare”, but general availability theorem states that we cannot automatically fail over to it without intervention from another source, otherwise you risk creating 2 machines that both consider themselves primaries because of packet loss between them.

hbogert · 11 April 2024 21:19

@jameinel that makes a lot of sense as the CAP theorem is not unknown to me.

That said, I do think I always used juju remove-machine yet came into situations where the system locked up. but I guess that would be a separate issue.

Should your elaboration be added to the documentation? It would’ve cleared up a lot of confusion for me and I definitely would not have tried to dabble with less than 3 instances.