Controller does not respond to any commands. Model Cache, mongo or something else to blame?

We have an HA controller running juju 3.3.3, which has a very strange problem. A few days ago everything suddenly stopped working. The controller does not respond to any communication at all and no commands work.

We have seen the following series of symptoms:

  1. There have been no entries in logsink.log for any of the controller units since the 15th. They all stopped at the same time following a flurry of lines of the type machine-0 2024-06-15 23:37:55 ERROR juju.api.watcher watcher.go:95 error trying to stop watcher: hub txn watcher sync error: starting change stream: EOF
  2. In the unit-controller-X.log files we see this:
2024-06-15 23:37:56 ERROR juju.worker.dependency engine.go:695 "migration-minion" manifold worker returned unexpected error: hub txn watcher sync error: starting change stream: EOF
2024-06-15 23:37:56 INFO juju.worker.uniter uniter.go:347 unit "controller/0" shutting down: catacomb 0xc00c02c000 is dying
2024-06-15 23:37:56 ERROR juju.worker.dependency engine.go:695 "uniter" manifold worker returned unexpected error: hub txn watcher sync error: starting change stream: EOF
2024-06-15 23:37:56 ERROR juju.worker.dependency engine.go:695 "hook-retry-strategy" manifold worker returned unexpected error: hub txn watcher sync error: starting change stream: EOF
2024-06-15 23:37:56 ERROR juju.worker.dependency engine.go:695 "meter-status" manifold worker returned unexpected error: hub txn watcher sync error: starting change stream: EOF
2024-06-15 23:37:56 ERROR juju.worker.dependency engine.go:695 "api-address-updater" manifold worker returned unexpected error: hub txn watcher sync error: starting change stream: EOF
2024-06-15 23:37:56 ERROR juju.worker.dependency engine.go:695 "migration-inactive-flag" manifold worker returned unexpected error: hub txn watcher sync error: starting change stream: EOF
2024-06-15 23:37:56 INFO juju.worker.logger logger.go:136 logger worker stopped
  1. In the machine-X.log files we see a repeating pattern, where the same series of entries keep repeating over and over again:
2024-06-17 11:39:55 ERROR juju.worker.modelcache worker.go:373 watcher error: error loading entities for model b6765da3-cb8b-4fbe-801e-2b36058a6054: failed to initialise backing for models:b6765da3-cb8b-4fbe-801e-2b36058a6054: settings doc "b6765da3-cb8b-4fbe-801e-2b36058a6054:e" not found, getting new watcher
2024-06-17 11:39:55 INFO juju.state.allwatcher allwatcher.go:1819 allwatcher loaded for model "7d20b6e0-c474-4db8-8340-4e39a229a7de" in 17.332259ms
2024-06-17 11:39:55 INFO juju.state.allwatcher allwatcher.go:1819 allwatcher loaded for model "95083272-d26a-420d-8c51-91527216d028" in 18.554343ms
2024-06-17 11:39:55 INFO juju.state.allwatcher allwatcher.go:1819 allwatcher loaded for model "52957a21-50fe-4835-8337-f77deb3ee179" in 21.127774ms
2024-06-17 11:39:55 INFO juju.state.allwatcher allwatcher.go:1819 allwatcher loaded for model 
  1. We have run the juju_engine_report command on the controller instances. It reports that most of the workers are not running, because their dependencies are not started yet. One of them sticks out:
  multiwatcher:
    inputs:
    - state
    - upgrade-database-flag
    report:
      errors:
      - 'error loading entities for model b6765da3-cb8b-4fbe-801e-2b36058a6054: failed
        to initialise backing for models:b6765da3-cb8b-4fbe-801e-2b36058a6054: settings
        doc "b6765da3-cb8b-4fbe-801e-2b36058a6054:e" not found'
      - 'error loading entities for model b6765da3-cb8b-4fbe-801e-2b36058a6054: failed
        to initialise backing for models:b6765da3-cb8b-4fbe-801e-2b36058a6054: settings
        doc "b6765da3-cb8b-4fbe-801e-2b36058a6054:e" not found'
      - 'error loading entities for model b6765da3-cb8b-4fbe-801e-2b36058a6054: failed
        to initialise backing for models:b6765da3-cb8b-4fbe-801e-2b36058a6054: settings
        doc "b6765da3-cb8b-4fbe-801e-2b36058a6054:e" not found'
      - 'error loading entities for model b6765da3-cb8b-4fbe-801e-2b36058a6054: failed
        to initialise backing for models:b6765da3-cb8b-4fbe-801e-2b36058a6054: settings
        doc "b6765da3-cb8b-4fbe-801e-2b36058a6054:e" not found'
      - 'error loading entities for model b6765da3-cb8b-4fbe-801e-2b36058a6054: failed
        to initialise backing for models:b6765da3-cb8b-4fbe-801e-2b36058a6054: settings
        doc "b6765da3-cb8b-4fbe-801e-2b36058a6054:e" not found'
      num-watchers: 1
      queue-age: 0
      queue-size: 0
      restart-count: 2047
      store-size: 2460
    start-count: 1
    started: "2024-06-17 09:09:18"
    state: started
  1. We have inspected the mongo database a little. It does seem like there is a model with the uuid mentioned in the errors, and it seems to have been force removed at some point in the past. When we run the rs.status command in the mongo shell, it says that the old primary instance has the status “RECOVERING”.
  2. When we run commands, such as juju status --verbose --debug, the client was previously able to connect to the controller instances, but then reported the error model cache: model [UUID YOU WERE QUERYING] did not appear in cache timeout. Subsequently the controller was restarted which has led to the API server not starting up so now we get no response.

Our mitigation efforts have been plagued by uncertainty as to what exactly is broken. We have so far only tried to restart the controller units.

Where do we go from here?

Stop the controller jujud processes, and look at rs.status() again.

First ensure that the IP addresses of the replica-set members correspond with the controller IPs.

Post the output here sans any sensitive data.