Deferring Events: Details and Dilemmas

pengale · 21 March 2022 14:29

What happens when a Charm is not yet ready to process an event? For example, a charm might wish to handle a relation-joined event by populating the relation with information about how to connect to its workload. But what if the workload services are still in the middle of initialising when the charm receives the event?

There are three kinds of events that the Operator Framework handles:

Juju hooks. For example: install, relation-joined
Events generated by the workload. For example: pebble-ready, pebble-custom-notice.
Events generated by the framework. For example: collect-unit-status, custom events

Some of the events of the first two kinds can be “deferred”, which asks the Operator Framework to re-run the handler later. Not all events can be deferred - for example, action events, secret expired events, and stop events cannot.

The Operator Framework does not have a continuously running main loop that will process deferred or framework generated events on a specific schedule. Instead, Ops is invoked by the Juju agent whenever it receives an event from the Juju controller. This means that deferred events won’t be re-run until the next event arrives from Juju. In some cases, a deferred event will need to await the arrival of the next update-status hook.

By default, the update-status interval is five minutes, so this may lead to a five minute delay in processing the deferred event. However, this interval can be configured by admins, so it might be hours or even days, so it’s not safe to assume that the maximum delay before the deferred event handler will re-run is five minutes.

Despite this potential delay, it is not good practice to block event execution by tailing a log or executing a long sleep in the middle of an event handler.

Often, Juju’s event model provides inherent workarounds for a delay.

For example, let’s take a look at a common pattern: configuring a service when a relation-joined hook fires.

def _on_db_relation_joined(self, event):
    try:
        self._set_api_key(event.relation)
    except ops.pebble.ConnectionError:
        event.defer()
        return

Assume that _set_api_key does something sensible, like writing to a file which will trigger the workload service to dynamically load its configuration. What happens the very first time relation-joined fires after the install hook, possibly before the workload container becomes ready?

In that particular case, we know that Juju will dispatch a pebble-ready event when the charm’s container is up and running. pebble-ready will trigger our deferred event at exactly the moment that it can be run, with no delay.

However, a better pattern here is to call _set_api_key when all preconditions are met, regardless of which event has fired.

def _on_db_relation_joined(self, event):
    try:
        self._set_api_key(event.relation)
    except ops.pebble.ConnectionError:
        logger.debug(“Not setting API key: container not yet ready.”)
        # The pebble-ready event will set the key when the container is ready.
        return

def _on_container_pebble_ready(self, event):
    rel = self.model.get_relation("db")
    if rel:  # else the relation-joined event will set the key
        try:
            self._set_api_key(rel)
        except ops.pebble.ConnectionError:
            logger.warning("Connection to Pebble lost in pebble-ready")
            event.defer()
            return

However, this isn’t always possible. Consider another common case, where it’s config-changed that’s triggering the service configuration - although it’s most likely that any error communicating with Pebble is because the container isn’t ready yet, it might also be other causes (perhaps the container is very busy), and a config-changed event can happen at any time, not just during the setup phase. In this case, we have no choice but to defer:

def _on_config_changed(self, event):
    try:
        self._set_api_key(self.config['api_key'])
    except ops.pebble.ConnectionError:
        event.defer()
        return

That said …

The Deferrer’s Dilemmas: Consequences of deferring events

Dilemma #1: Lag in Execution

What if the service takes a little while to start, in a way that is not visible to an init daemon like pebble? Rabbitmq, for example, has some work to do when it first starts, and it is wise to do a liveness check before interacting with it. Here’s code that handles cases like this:

def _on_config_changed(self, event):
    if not self._liveness_check():
        event.defer()
        return
    self._set_api_key(self.config['api_key'])

Assume that _liveness_check pings a port that returns a reliable status indicator. Since we are waiting on two conditions, and pebble-ready may fire before the second condition is true, we have a decision to make. Defer anyway, or wait explicitly?

The answer is almost always to defer. The liveness_check is most likely to return False during an initial deployment, when the Juju agent will receive and process a flurry of relation-joined hooks, upon which our deferred event can piggyback.

Note that it’s often the case that the service needs more configuration than is found in the charm config - for example, from the relations. In that case, it’s often cleanest to attempt to holistically configure the workload from all of the relevant events (config-changed, relation-changed, pebble-ready, etc), with appropriate guards, and rely on the work being done when everything is ready. For example:

def _on_config_changed(self, event):
    self._push_config(event)

def _on_relation_changed(self, event):
    self._push_config(event)

def _on_container_pebble_ready(self, event):
    self._push_config(event)

def _push_config(self, event):
    if not self.config.get("api_key"):
        return
    if not self.model.get_relation("db"):
        return
    try:
        self._set_api_key(self.config["api_key"], self.model.get_relation["db"])
    except ops.pebble.ConnectionError:
        event.defer()
        return

A better approach to a slow starter than a plain defer() is to have the service tell Ops (via Juju) that it has started. This can be done with a Pebble custom notice (in Juju 3.4 and above). Ideally, the service has some form of “ready” hook that can be customised to run pebble notify, but if not a custom script can be deployed to the container that runs the liveness check until it succeeds, then calls pebble notify. In the charm, the pebble-custom-notice event can be observed, and that handler can call _set_api_key.

When Pebble notices are not available, such as in a machine charm, we generally recommend a small number of retries when the charm has reason to expect that the issue will resolve in a very small period of time. When it’s likely that several seconds, or longer, will be required, we recommend considering what other events are likely to arrive after the service is ready, and having each of them take care of the post-service-start work via a common method. If the charm is unlikely to receive other events (remember: you shouldn’t rely on update-status arriving in time) then we recommend judicious use of defer().

Note: in the very simple case where a Charm author is trying to test a service – e.g. loki – and there exists a test charm which needs to wait for the loki service to be ready, there usually aren’t enough events in the model to quickly unstick a deferred action.

There are several solutions:

Set a high frequency for update-status in a testing environment. This is usually an anti-pattern, because it runs the risk of masking other performance problems.
Start a ping-pong of updated relation data. This is also an anti-pattern; relations are not fit to serve as a bus for what is essentially inter-process communication.
Use jhack ffwd or jhack fire to trigger the event. This requires the participation of a client computer, running outside of the model. In test environments, this computer can be the same one that is driving the test.
In a Kubernetes sidecar charm, use a Pebble custom notice to have the workload notify the charm (via Juju) that the service is ready.

Dilemma #2: Out of order events

When a charm defers an event, it gets added to a queue of events that are executed before the event that triggered the Ops run. This is done to preserve ordering whenever possible, but it can cause two issues:

Firstly, a deferred event may fire before the event that will ‘fix’ it.

In these cases, there may be an issue with the charm’s logic, and it may be better to refactor the charm to better reflect the order of operations in the Juju cloud.

For example, the Juju controller emits an install event for every charm, followed by a config_changed event. If logic in the install event cannot complete until after config_changed has been triggered, it is probably necessary to move that logic into the config_changed handler.

Secondly, deferring one event does not prevent other events in the queue from being processed. One consequence is that a twice deferred event may be executed outside of the ordering contracts that Juju makes.

For example, install is always followed by config-changed. If a charm defers the install handler once, it will execute before any config-changed handlers. However, if the install event is deferred again, the config-changed event may be processed, and the install hook would then fire in something other than the expected order.

In these cases, ‘later’ events may need to ‘catch-up’ or ‘reconcile’ to ensure that the charm is in the expected state, or check for missing work and defer() the second event as well. Care needs to be taken when designing the charm logic that it’s not possible to continuously build up a queue of deferred events - at some point the charm needs to give up and ask the user to solve the problem.

pengale · 21 March 2022 14:31

My comment from the draft doc to @manadart: I spent Friday playing around with workarounds [for the “lag” issue]. The juju agent won’t allow dispatch to be called from within a hook’s context. I tried to be clever and use subprocess to launch a child process that sleeps, then calls dispatch, but the Juju agent appears to be cleverer than I am. The child process fails with a complaint about being in a hook’s context.

Can you think of a better way to do this? Keeping in mind that this needs to run in a sidecar, where we don’t necessary have an init daemon like systemd.

pengale · 21 March 2022 14:32

TODO: drop in some charts to help folks visualize.

(See @ppasotti’s existing charts for details.)

manadart · 21 March 2022 16:09

I’m wrinkling my nose a bit about circumventing Juju like this, but you can do env.pop("JUJU_CONTEXT_ID") to pretend, as far as Juju is concerned, that you’re not calling from another hook context.

pengale · 21 March 2022 16:15

I tried something like that, and it didn’t’ work. I suspect that the failure had to do with the dispatch script being fired while the parent process was still processing other hooks. That’s a finicky thing to solve, and is definitely a potential source “bad smell” you were detecting …

I’ll play around with it a little bit more, to see if I can put together a minimally stinky version.

hypeitnow · 18 January 2023 07:47

Hi all,

I think that the doc definitely needs an update. The dispatch script is no longer there.

I am using juju in Edge channel.

How am I supposed to manually dispatch an event in this version? @ppasotti have you maybe figured that out yet?

Thank you

ppasotti · 18 January 2023 09:05

Hi Matheus. In juju 3 juju run has become juju exec, so the command Penny mentioned becomes

juju exec -u {{some-charm}}/0 "JUJU_DISPATCH_PATH=hooks/{{some-hook}} ./dispatch"

and should still work.

The dispatch script is still there. You can find it by ssh’ing into the unit and ls’ing the charm root dir.

Will update this doc. As for an updated guide to triggering events in juju3, I recommend reading

and take a look at GitHub - canonical/jhack: Chock-full of Juju hackery.

danielarndt · 27 January 2023 18:15

Thanks for this, the examples were very helpful.

I’m documenting a couple of nitpicks I noticed as I was reading through, below.

I suspect this was meant to read

[…] will trigger our deferred event […]

This part is a little odd to read as a single paragraph. The preceding paragraph mentions “three fixes”, so you can stumble your way through, but perhaps it was meant to be a numbered list?