Standby HA does not work

hbogert · 26 March 2023 10:44

I’m in the midst of an upgrade. At one point I needed to upgrade one of my 2 controllers.

The documentation says that HA 1/2 would mean that 1 is on standby.

After removing 1 of the 2 controllers the second one can no longer start on mongodb level:

Mar 26 10:43:13 node3 mongod.37017[20176]: [replexec-0] Failed to connect to 192.168.1.203:37017 after 5000ms milliseconds, giving up.
Mar 26 10:43:13 node3 mongod.37017[20176]: [replexec-0] New replica set config in use: { _id: "juju", version: 11, protocolVersion: 1, members: [ { _id: 3, host: "192.168.1.203:37017", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 1.0, tags: { juju-machine-id: "56" }, slaveDelay: 0, votes: 1 }, { _id: 4, host: "192.168.1.185:37017", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 0.0, tags: { juju-machine-id: "60" }, slaveDelay: 0, votes: 0 } ], settings: { chainingAllowed: true, heartbeatIntervalMillis: 2000, heartbeatTimeoutSecs: 10, electionTimeoutMillis: 10000, catchUpTimeoutMillis: 60000, catchUpTakeoverDelayMillis: 30000, getLastErrorModes: {}, getLastErrorDefaults: { w: 1, wtimeout: 0 }, replicaSetId: ObjectId('5a4033aa6826d45dd1be6395') } }
Mar 26 10:43:13 node3 mongod.37017[20176]: [replexec-0] This node is 192.168.1.185:37017 in the config
Mar 26 10:43:13 node3 mongod.37017[20176]: [replexec-0] transition to STARTUP2 from STARTUP
Mar 26 10:43:13 node3 mongod.37017[20176]: [replexec-0] Starting replication storage threads
Mar 26 10:43:13 node3 mongod.37017[20176]: [replexec-0] transition to RECOVERING from STARTUP2
Mar 26 10:43:13 node3 mongod.37017[20176]: [NetworkInterfaceASIO-Replication-0] Connecting to 192.168.1.203:37017
Mar 26 10:43:13 node3 mongod.37017[20176]: [replexec-0] Starting replication fetcher thread
Mar 26 10:43:13 node3 mongod.37017[20176]: [replexec-0] Starting replication applier thread
Mar 26 10:43:13 node3 mongod.37017[20176]: [replexec-0] Starting replication reporter thread
Mar 26 10:43:13 node3 mongod.37017[20176]: [rsSync] transition to SECONDARY from RECOVERING
Mar 26 10:43:23 node3 mongod.37017[20176]: [replexec-0] Error in heartbeat (requestId: 1) to 192.168.1.203:37017, response status: NetworkInterfaceExceededTimeLimit: Couldn't get a connection within the time limit
Mar 26 10:43:23 node3 mongod.37017[20176]: [replexec-0] Member 192.168.1.203:37017 is now in state RS_DOWN
Mar 26 10:43:28 node3 mongod.37017[20176]: [rsBackgroundSync] waiting for 2 pings from other members before syncing
Mar 26 10:43:33 node3 mongod.37017[20176]: [NetworkInterfaceASIO-Replication-0] Failed to connect to 192.168.1.203:37017 - NetworkInterfaceExceededTimeLimit: Operation timed out
Mar 26 10:43:33 node3 mongod.37017[20176]: [NetworkInterfaceASIO-Replication-0] Connecting to 192.168.1.203:37017
Mar 26 10:43:38 node3 mongod.37017[20176]: [conn18] end connection 192.168.1.222:33012 (8 connections now open)
Mar 26 10:43:38 node3 mongod.37017[20176]: [replexec-0] Error in heartbeat (requestId: 3) to 192.168.1.203:37017, response status: NetworkInterfaceExceededTimeLimit: Couldn't get a connection within the time limit
Mar 26 10:43:43 node3 mongod.37017[20176]: [rsBackgroundSync] waiting for 2 pings from other members before syncing
Mar 26 10:43:53 node3 mongod.37017[20176]: [signalProcessingThread] got signal 15 (Terminated), will terminate after current cmd ends
Mar 26 10:43:53 node3 systemd[1]: Stopping juju state database...
Mar 26 10:43:53 node3 mongod.37017[20176]: [signalProcessingThread] shutdown: going to close listening sockets...
Mar 26 10:43:53 node3 mongod.37017[20176]: [signalProcessingThread] removing socket file: /tmp/mongodb-37017.sock
Mar 26 10:43:53 node3 mongod.37017[20176]: [signalProcessingThread] shutdown: removing all drop-pending collections...
Mar 26 10:43:53 node3 mongod.37017[20176]: [signalProcessingThread] shutdown: removing checkpointTimestamp collection...
Mar 26 10:43:53 node3 mongod.37017[20176]: [signalProcessingThread] shutting down replication subsystems
Mar 26 10:43:53 node3 mongod.37017[20176]: [signalProcessingThread] Stopping replication reporter thread
Mar 26 10:43:53 node3 mongod.37017[20176]: [signalProcessingThread] Stopping replication fetcher thread
Mar 26 10:43:53 node3 mongod.37017[20176]: [signalProcessingThread] Stopping replication applier thread
Mar 26 10:43:53 node3 mongod.37017[20176]: [NetworkInterfaceASIO-Replication-0] Failed to connect to 192.168.1.203:37017 - NetworkInterfaceExceededTimeLimit: Operation timed out
Mar 26 10:43:53 node3 mongod.37017[20176]: [NetworkInterfaceASIO-Replication-0] Connecting to 192.168.1.203:37017
Mar 26 10:43:53 node3 mongod.37017[20176]: [replexec-0] Error in heartbeat (requestId: 5) to 192.168.1.203:37017, response status: NetworkInterfaceExceededTimeLimit: Couldn't get a connection within the time limit
Mar 26 10:43:53 node3 mongod.37017[20176]: [signalProcessingThread] Stopping replication storage threads
Mar 26 10:43:53 node3 mongod.37017[20176]: [signalProcessingThread] Shutting down full-time diagnostic data capture
Mar 26 10:43:53 node3 mongod.37017[20176]: [WTOplogJournalThread] oplog journal thread loop shutting down
Mar 26 10:43:53 node3 mongod.37017[20176]: [signalProcessingThread] WiredTigerKVEngine shutting down
Mar 26 10:43:53 node3 mongod.37017[20176]: [signalProcessingThread] WiredTiger message [1679827433:687931][20176:0x7f06360b1700], txn-recover: Main recovery loop: starting at 12404/6784
Mar 26 10:43:53 node3 mongod.37017[20176]: [signalProcessingThread] WiredTiger message [1679827433:780919][20176:0x7f06360b1700], txn-recover: Recovering log 12404 through 12405
Mar 26 10:43:53 node3 mongod.37017[20176]: [signalProcessingThread] WiredTiger message [1679827433:838389][20176:0x7f06360b1700], txn-recover: Recovering log 12405 through 12405
Mar 26 10:43:54 node3 mongod.37017[20176]: [signalProcessingThread] shutdown: removing fs lock...
Mar 26 10:43:54 node3 mongod.37017[20176]: [signalProcessingThread] now exiting
Mar 26 10:43:54 node3 mongod.37017[20176]: [signalProcessingThread] shutting down with code:0
Mar 26 10:43:54 node3 systemd[1]: Stopped juju state database.

Is the documentation wrong? This resulted for me in a non working cluster as the controllers would no longer function.

hpidcock · 27 March 2023 02:25

Could you please link to the documentation you are referring to?

You should have 3 controllers when in HA where only 1 can be down/removed.

Feel free to join our Mattermost if you need more help.

hbogert · 27 March 2023 06:51

In https://juju.is/docs/olm/manage-controllers

The output to juju controllers --refresh now becomes as below, where the HA column says 1/2 – that is, there is now a single active controller, with the remaining controller on standby.

Controller  Model    User   Access     Cloud/Region         Models  Machines    HA  Version
aws-ha*     default  admin  superuser  aws/us-east-1             2         2   1/2  2.4-beta2

hbogert · 3 May 2023 18:33

I’m still wondering if i totally misinterpreted the hot-standby part in case n=2

hbogert · 22 October 2023 18:56

Gently nudging @hpidcock