I’m in the midst of an upgrade. At one point I needed to upgrade one of my 2 controllers.
The documentation says that HA 1/2
would mean that 1 is on standby.
After removing 1 of the 2 controllers the second one can no longer start on mongodb level:
Mar 26 10:43:13 node3 mongod.37017[20176]: [replexec-0] Failed to connect to 192.168.1.203:37017 after 5000ms milliseconds, giving up.
Mar 26 10:43:13 node3 mongod.37017[20176]: [replexec-0] New replica set config in use: { _id: "juju", version: 11, protocolVersion: 1, members: [ { _id: 3, host: "192.168.1.203:37017", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 1.0, tags: { juju-machine-id: "56" }, slaveDelay: 0, votes: 1 }, { _id: 4, host: "192.168.1.185:37017", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 0.0, tags: { juju-machine-id: "60" }, slaveDelay: 0, votes: 0 } ], settings: { chainingAllowed: true, heartbeatIntervalMillis: 2000, heartbeatTimeoutSecs: 10, electionTimeoutMillis: 10000, catchUpTimeoutMillis: 60000, catchUpTakeoverDelayMillis: 30000, getLastErrorModes: {}, getLastErrorDefaults: { w: 1, wtimeout: 0 }, replicaSetId: ObjectId('5a4033aa6826d45dd1be6395') } }
Mar 26 10:43:13 node3 mongod.37017[20176]: [replexec-0] This node is 192.168.1.185:37017 in the config
Mar 26 10:43:13 node3 mongod.37017[20176]: [replexec-0] transition to STARTUP2 from STARTUP
Mar 26 10:43:13 node3 mongod.37017[20176]: [replexec-0] Starting replication storage threads
Mar 26 10:43:13 node3 mongod.37017[20176]: [replexec-0] transition to RECOVERING from STARTUP2
Mar 26 10:43:13 node3 mongod.37017[20176]: [NetworkInterfaceASIO-Replication-0] Connecting to 192.168.1.203:37017
Mar 26 10:43:13 node3 mongod.37017[20176]: [replexec-0] Starting replication fetcher thread
Mar 26 10:43:13 node3 mongod.37017[20176]: [replexec-0] Starting replication applier thread
Mar 26 10:43:13 node3 mongod.37017[20176]: [replexec-0] Starting replication reporter thread
Mar 26 10:43:13 node3 mongod.37017[20176]: [rsSync] transition to SECONDARY from RECOVERING
Mar 26 10:43:23 node3 mongod.37017[20176]: [replexec-0] Error in heartbeat (requestId: 1) to 192.168.1.203:37017, response status: NetworkInterfaceExceededTimeLimit: Couldn't get a connection within the time limit
Mar 26 10:43:23 node3 mongod.37017[20176]: [replexec-0] Member 192.168.1.203:37017 is now in state RS_DOWN
Mar 26 10:43:28 node3 mongod.37017[20176]: [rsBackgroundSync] waiting for 2 pings from other members before syncing
Mar 26 10:43:33 node3 mongod.37017[20176]: [NetworkInterfaceASIO-Replication-0] Failed to connect to 192.168.1.203:37017 - NetworkInterfaceExceededTimeLimit: Operation timed out
Mar 26 10:43:33 node3 mongod.37017[20176]: [NetworkInterfaceASIO-Replication-0] Connecting to 192.168.1.203:37017
Mar 26 10:43:38 node3 mongod.37017[20176]: [conn18] end connection 192.168.1.222:33012 (8 connections now open)
Mar 26 10:43:38 node3 mongod.37017[20176]: [replexec-0] Error in heartbeat (requestId: 3) to 192.168.1.203:37017, response status: NetworkInterfaceExceededTimeLimit: Couldn't get a connection within the time limit
Mar 26 10:43:43 node3 mongod.37017[20176]: [rsBackgroundSync] waiting for 2 pings from other members before syncing
Mar 26 10:43:53 node3 mongod.37017[20176]: [signalProcessingThread] got signal 15 (Terminated), will terminate after current cmd ends
Mar 26 10:43:53 node3 systemd[1]: Stopping juju state database...
Mar 26 10:43:53 node3 mongod.37017[20176]: [signalProcessingThread] shutdown: going to close listening sockets...
Mar 26 10:43:53 node3 mongod.37017[20176]: [signalProcessingThread] removing socket file: /tmp/mongodb-37017.sock
Mar 26 10:43:53 node3 mongod.37017[20176]: [signalProcessingThread] shutdown: removing all drop-pending collections...
Mar 26 10:43:53 node3 mongod.37017[20176]: [signalProcessingThread] shutdown: removing checkpointTimestamp collection...
Mar 26 10:43:53 node3 mongod.37017[20176]: [signalProcessingThread] shutting down replication subsystems
Mar 26 10:43:53 node3 mongod.37017[20176]: [signalProcessingThread] Stopping replication reporter thread
Mar 26 10:43:53 node3 mongod.37017[20176]: [signalProcessingThread] Stopping replication fetcher thread
Mar 26 10:43:53 node3 mongod.37017[20176]: [signalProcessingThread] Stopping replication applier thread
Mar 26 10:43:53 node3 mongod.37017[20176]: [NetworkInterfaceASIO-Replication-0] Failed to connect to 192.168.1.203:37017 - NetworkInterfaceExceededTimeLimit: Operation timed out
Mar 26 10:43:53 node3 mongod.37017[20176]: [NetworkInterfaceASIO-Replication-0] Connecting to 192.168.1.203:37017
Mar 26 10:43:53 node3 mongod.37017[20176]: [replexec-0] Error in heartbeat (requestId: 5) to 192.168.1.203:37017, response status: NetworkInterfaceExceededTimeLimit: Couldn't get a connection within the time limit
Mar 26 10:43:53 node3 mongod.37017[20176]: [signalProcessingThread] Stopping replication storage threads
Mar 26 10:43:53 node3 mongod.37017[20176]: [signalProcessingThread] Shutting down full-time diagnostic data capture
Mar 26 10:43:53 node3 mongod.37017[20176]: [WTOplogJournalThread] oplog journal thread loop shutting down
Mar 26 10:43:53 node3 mongod.37017[20176]: [signalProcessingThread] WiredTigerKVEngine shutting down
Mar 26 10:43:53 node3 mongod.37017[20176]: [signalProcessingThread] WiredTiger message [1679827433:687931][20176:0x7f06360b1700], txn-recover: Main recovery loop: starting at 12404/6784
Mar 26 10:43:53 node3 mongod.37017[20176]: [signalProcessingThread] WiredTiger message [1679827433:780919][20176:0x7f06360b1700], txn-recover: Recovering log 12404 through 12405
Mar 26 10:43:53 node3 mongod.37017[20176]: [signalProcessingThread] WiredTiger message [1679827433:838389][20176:0x7f06360b1700], txn-recover: Recovering log 12405 through 12405
Mar 26 10:43:54 node3 mongod.37017[20176]: [signalProcessingThread] shutdown: removing fs lock...
Mar 26 10:43:54 node3 mongod.37017[20176]: [signalProcessingThread] now exiting
Mar 26 10:43:54 node3 mongod.37017[20176]: [signalProcessingThread] shutting down with code:0
Mar 26 10:43:54 node3 systemd[1]: Stopped juju state database.
Is the documentation wrong? This resulted for me in a non working cluster as the controllers would no longer function.