With the landing of PR 9116 - Use StatePool for initial database connections the behaviour of the state transaction watcher worker changed for the controller model.
Origin
Back in the dawn of time, every State
object had it’s own transaction watcher that polled the database every five seconds to look for changes. This was fine when there was only every one model, but with Juju 2.0 this changed. Initially developers didn’t notice the load because it wasn’t really a problem in most situations. However the key problem that pushed this change was a ProdStack problem where the API servers would go into a death spiral when someone removed an application in a model. This was due to the 400 or so State
objects waking up and scanning mongo for the changes. There are a lot of document changes when an application is removed, particularly if there are many agents. This lead to i/o timeout
errors from mongo amongst other thigns.
Initial watcher work
Back in early Juju 2.3 a change was made to introduce a different type of transaction watcher. This one was owned by the StatePool
and read the changes from mongo in a much more adaptive manner. It started by looking very regularly (10ms) and backed off if there were no changes to a maximum of five seconds. The worst case poll matched the current behavior. This meant that watchers would be notified much closer to the time of the document changing. This transaction watcher is found in state/watcher/txnwatcher.go. This watcher polls mongo regularly to read the changes to the transaction log collection. It then collates the changes and publishes the event on a SimpleHub
that is owned by the StatePool
. New State
objects that are created by the StatePool
have a HubWatcher
as their watcher worker.
Now back to PR 9116.
This change makes the StatePool
the primary object that is opened to connect to the database rather than a single State
object. This means that the State
instance for the controller model is now getting changes more often than before.
The original transaction watcher
and the new txnwatcher
both coalesce multiple changes of a single document into a single notification. However, the original would wait for five seconds and then coalesce changes, whereas the new one coalesces events over a much shorter timeframe.
This means that multiple changes to a document that may have come through as a single change may now come through as multiple changes. Now to be clear, if the code cares, it has always had a bug, we are just surfacing it.
Test Suite changes
If you have a test that is interacting with Mongo, you’ll most likely be using one of two base suites:
-
JujuConnSuite
- this is the one we tell people to stop using -
StateSuite
- from thestate/testing
package
JujuConnSuite
The JujuConnSuite
does many things. The key as far as watchers are concerned is that it is a wall clock based test suite. That means that the StatePool
is created with clock.WallClock
. This clock is then also passed in to each of the State
objects and the workers for them.
Watchers will be triggered automatically and the State.StartSync
method isn’t needed any more to prod the underlying mongo transaction watcher.
StateSuite
The StateSuite
uses a TestClock
. This means that time doesn’t advance unless you tell it to. This includes the transaction polling worker. The StartSync
method on the State
object has been updated to advance the clock by a second if it is able to advance (wall clocks aren’t).
This does mean that we can control with a lot of certainty when changes to the documents will be coalesced and when they won’t - by advancing the clock between changes.