Use Watchers But Know What You're Doing

timClicks · 3 March 2020 13:37

Part of https://discourse.jujucharms.com/t/read-before-contributing/47, an opinionated guide by William Reade

Juju uses a lot of things called watchers, and they aren’t always
consistent. Most of the time, the word refers to a type with a Changes
channel, from which a single client can receive a stream of events; the
semantics may vary but the main point of a watcher is to represent
changing state in a form convenient for a select loop.

There are two common forms of watcher, distinguished by whether they
implement the interface defined in the ./watcher package, or the one
defined in the ./state package. All your workers should be using the
former, and they should be used roughly like this:

func (w *Worker) loop() error {
    watch, err := w.config.Facade.Watch()
    if err != nil {
        return errors.Trace(err)
    }
    if err := w.catacomb.Add(watch); err != nil {
        return errors.Trace(err)
    }
    for {
        select {
        case <-w.catacomb.Dying():
            return w.catacomb.ErrDying()
        case value, ok := <-watch.Changes():
            if !ok {
                return errors.New("watcher closed channel")
            }
            if err := w.handle(value); err != nil {
                return errors.Trace(err)
            }
        }
    }
}

(note that nice clean responsibility transfer to catacomb)

…but even the state watchers, which close their channels when they
stop and cause a surprising amount of semantic mess by doing so, share a
fundamental and critical feature:

Watchers Send Initial Events

Every watcher has a Changes channel; no watcher should be considered
functional until it’s delivered one event on that channel. That event is
the baseline against which subsequent changes should be considered to be
diffs; it’s the point at which you can read the value under observation
and know for sure that the watcher will inform you at least once if it
changes.

One useful feature of the initial events, exemplified in the watching
worker example above, is that the loop doesn’t need to distinguish
between the first event and any others: every event always indicates
that you should read some state and respond to it. If you’re handling
“initial” data differently to subsequent events you’re almost certainly
doing at least one of them wrong.

A lot of watchers send only struct{}s, indicating merely that the
domain under observation is no longer guaranteed to be the same as
before; several more deliver []strings identifying entities/domains
that are no longer guaranteed to have the same state as before; others
deliver different information, and some even include (parts of) the
state under observation packaged up for client consumption.

This technique is tempting but usually ends up slightly janky in
practice, for a few relatively minor reasons that seem to add up to the
level of actual annoyance:

you almost always need a representation for “nothing there right
now”, which yucks up the type you need to send (vs notification of
existence change like any other, and nonexistence notified on read
with the same errors you’d always have).
the more complex the data you send, the harder it is to aggregate
events correctly at any given layer; trivial notifies, though, can
be safely compressed at any point and still work as expected.
any data you send will be potentially degraded by latency, and you
might need to worry about that in the client; pure notifications
are easier to get right, because the handler always determines
what to do by requesting fresh domain state.
the more opinionated the data you send, the less useful it is to
future clients (and the more likely you are to implement almost
the same watcher 5 times for 5 clients, and to only fix the common
bug in 3 or 4 of them), and, well, it’s just asking for trouble
unless you already understand exactly what data you’re going to
need.
you really don’t understand exactly what data you’re going to need,
and a watcher format change is almost certainly an api change, and
you don’t need that hassle as well. If you get a notify watcher
wrong, so it doesn’t watch quite enough stuff, you can easily fix
the bug by associating a new notifywatcher to handle the data you
missed. (those events might be a tad enthusiastic at times, but your
clients all signed up for at-least-once – you’re not breaking any
contracts – and you’re also free to sub in a tighter implementation
that still-more-closely matches the desired domain when it becomes
useful to do so.) In short, notifications are a lot easier to tune
as your understanding grows.

Regardless, even if you do end up in a situation where you want to send
data-heavy events, make sure you still send initial events. You’re
pretty free to decide what changes you want to report; but you’re not
free to skip the initial sync that your clients depend on to make use of
you.

State Watchers Are Tricky

For one, they implement the evil watcher interface that closes their
channel, and it’s hard to rearchitect matters to fix this; for another,
they use the other layer of watching I haven’t mentioned yet, and that
drags in a few other unpleasant concerns.

The most important thing to know when writing a state watcher is that
you have to play nice with the underlying substrate (implemented in
state/watcher, and with whom you communicate by registering and
unregistering channels) otherwise you can block all its other clients.
Yes, that is both bizarre and terrifying, but there’s not much we can do
without serious rework; for the moment, just make sure you (1) aggregate
incoming watcher events before devoting any processing power to handling
them and (2) keep your database accesses (well, anything that keeps you
out of your core select loop) to an absolute minimum.

This is another reason to implement notification watchers by default –
everything you do in the process of converting the document-level change
notification stream into Something Else increases the risk you run of
disrupting the operation of other watchers in the same system. Merely
turning the raw stream into business-logic-level change notifications is
quite enough responsibility for one type, and there is depressingly
little to be gained from making this process any more complex or
error-prone than it already is.

(Also, mgo has the entertaining property of panicking when used after
the session’s been closed; and state watcher lifetimes are not cleanly
associated with the lifetime of the sessions they might copy if they
were to read from the DB at just the wrong moment: e.g. while handling
an event delivered just before the underlying watcher was killed
(after which we have no guarantee of db safety). And the longer you
spend reading, the worse it potentially is. Be careful, and for
goodness’ sake dont write anything in a watcher.)

The Core Watcher Is A Lie

It’s just polling every 5s and spamming a relevant subset of what it
sees to all the channels the watchers on the next layer up have
registered with it. This is, ehh, not such a big deal – it’s sorta
dirty and inelegant and embarrassing, but it’s worked well enough for
long enough that I think it’s demonstrated its adequacy.

However, it plays merry hell with your tests if they’re using a real
watcher under the hood, because the average test will take much less
than 5s to write a change it expects to see a reaction to, and the
infrastructure won’t consider it worth mentioning for almost a full 5s,
which is too long by far.

So, state.State has a StartSync method that gooses the watcher into
action. If you’re testing a state watcher directly, just StartSync the
state you used to create it; and when triggering syncs in JujuConnSuite
tests, use the suite’s BackingState field to trigger watchers for the
controller model, and go via the BackingStatePool to trigger hosted
watcher syncs. Sorry

Next: https://discourse.jujucharms.com/t/feature-specific-heuristics/2746