Deferring Events: Details and Dilemmas

What happens when a Charm is not yet ready to process an event? For example, a charm might wish to handle a relation_joined event by populating the relation with information about how to connect to its workload. But what if the workload services are still in the middle of initializing when the charm receives the event?

We recommend that Charm authors call event.defer() in these cases (usually followed by a return).

There are some consequences to this pattern, however. There are times when authors need to be mindful of how things work under the hood, in order to understand how a charm will behave in production.

There are three kinds of events that the Operator Framework handles:

  1. Juju hooks. For example: install, relation-joined
  2. Events generated by the cloud service. For now, pebble-ready is the only cloud event that Juju supports, but there will be more in the future.
  3. Custom events.

The Operator Framework does not have a continuously running main loop that will process deferred or custom events on a specific schedule. Instead, Ops is invoked by the Juju Agent whenever it receives an event from the Juju Controller. This means that deferred events won’t be re-run until the next hook event or cloud event arrives. In some cases, a deferred event will need to await the arrival of the next update-status hook.

By default, the update-status interval is five minutes, so this may lead to a five minute delay in processing the deferred event. Despite this potential delay, it is not good practice to block event execution by tailing a log or executing a sleep in the middle of an event handler.

Often, Juju’s event model provides inherent workarounds for a delay.

For example, let’s take a look at a common pattern: configuring a service when the _config_changed hook fires.

def _on_config_changed(self, event):
    if not self._container.can_connect():

Assume that _set_api_key does something sensible, like writing to a file which will trigger the workload service to dynamically load its configuration. What happens the very first time _config_changed fires after the install hook, possibly before the workload container becomes ready?

In that particular case, we know that Juju will dispatch a pebble-ready event when the charm’s container is up and running. pebble-ready will trigger are deferred event at exactly the moment that it can be run, with no delay.

That said …

The Deferrer’s Dilemmas: Consequences of deferring events

Dilemma #1: Lag in Execution

What if the service takes a little while to start, in a way that is not visible to an init daemon like pebble? Rabbitmq, for example, has some work to do when it first starts, and it is wise to do a liveness check before interacting with it. Here’s code that handles cases like this:

def _on_config_changed(self, event):
    if not self._container.can_connect() or not self._liveness_check():

Assume that _liveness_check pings a port that returns a reliable status indicator (ideal), or greps logs for a “ready” message (less than ideal, but a perfectly acceptable approach). Since we are waiting on two conditions, and pebble-ready may fire before the second condition is true, we have a decision to make. Defer anyway, or wait explicitly?

The answer is almost always to defer. The liveness_check is most likely to return False during an initial deployment, when the Juju agent will receive and process a flurry of relation-joined hooks, upon which our deferred event can piggyback.

An inline sleep might appear to make things more efficient in a test environment, but sleeps are especially dangerous in production environments. Noisy neighbors might slow down even very simple calls, and it is difficult to expose slow inline calls to human operators in a way that is transparent to the Juju model.

In general, we recommend deferring hooks. This requires the least code, and the Charm will automatically become more efficient as Juju expands its awareness of cloud events, with no further development effort on the charm author’s part.

Note: in the very simple case where a Charm author is trying to test a service – e.g. loki – and there exists a test charm which needs to wait for the loki service to be ready, there usually aren’t enough events in the model to quickly unstick a deferred action. There are three fixes (though we only recommend fix #3):

Set a high frequency for update-status in a testing environment. This is usually an anti-pattern, because it runs the risk of masking other performance problems. Start a ping pong of updated relation data. This is also an anti-pattern; relations are not fit to serve as a bus for what is essentially inter process communication. Use dispatch to wake the framework after a delay (see Manually Dispatching Events below). This requires the participation of a client computer, running outside of the model. In test environments, this computer can be the same one that is driving the test.

In the future, “workload” events should become available, as a better resolution to the dilemma.

Dilemma #2: Out of order events

When a charm defers an event, it gets added to a queue of events that are executed before the event that triggered the Ops run. This is done to preserve ordering whenever possible, but it can cause two issues:

A deferred event may fire before the event that will “fix” it.

In these cases, there may be an issue with the charm’s logic, and it may be better to refactor the charm to better reflect the order of operations in the Juju cloud.

For example, the Juju controller emits an install event for every charm, followed by a config_changed event. If logic in the install event cannot complete until after config_changed has been triggered, it is probably necessary to move that logic into the config_changed handler.

Deferring one event does not prevent other events in the queue from being processed. One consequence is that a twice deferred event may be executed outside of the ordering contracts that Juju makes.

For example, install is always followed by config_changed. If a charm defers the install handler once, it will execute before any config_changed handlers. However, if the install event is deferred again, the config_changed event may be processed, and the install hook would then fire in something other than the expected order.

Manually Dispatching Events

Any machine or container hosting a charm has a dispatch script, which can be used to manually invoke an event:

juju run -u {{some-charm}}/0 "JUJU_DISPATCH_PATH=hooks/{{some-hook}} ./dispatch"

Replace {{some-charm}} with the name of a charm, and {{some-hook}} with update-status, the name of the hook to retry, or the name of a custom event.

The dispatch command can also be run while ssh’ed into the unit. Set the working directory to /var/lib/juju/agents/{{charm-name}}-0/charm in that case, and invoke juju-run instead of juju run. dispatch cannot be run from within Ops, however – a Charm cannot trigger its own dispatch.


My comment from the draft doc to @manadart: I spent Friday playing around with workarounds [for the “lag” issue]. The juju agent won’t allow dispatch to be called from within a hook’s context. I tried to be clever and use subprocess to launch a child process that sleeps, then calls dispatch, but the Juju agent appears to be cleverer than I am. The child process fails with a complaint about being in a hook’s context.

Can you think of a better way to do this? Keeping in mind that this needs to run in a sidecar, where we don’t necessary have an init daemon like systemd.

TODO: drop in some charts to help folks visualize.

(See @ppasotti’s existing charts for details.)

I’m wrinkling my nose a bit about circumventing Juju like this, but you can do env.pop("JUJU_CONTEXT_ID") to pretend, as far as Juju is concerned, that you’re not calling from another hook context.

1 Like

I tried something like that, and it didn’t’ work. I suspect that the failure had to do with the dispatch script being fired while the parent process was still processing other hooks. That’s a finicky thing to solve, and is definitely a potential source “bad smell” you were detecting …

I’ll play around with it a little bit more, to see if I can put together a minimally stinky version. :slight_smile: