What happens when a Charm is not yet ready to process an event? For example, a charm might wish to handle a relation-joined
event by populating the relation with information about how to connect to its workload. But what if the workload services are still in the middle of initialising when the charm receives the event?
There are three kinds of events that the Operator Framework handles:
- Juju hooks. For example:
install
,relation-joined
- Events generated by the workload. For example:
pebble-ready
,pebble-custom-notice
. - Events generated by the framework. For example:
collect-unit-status
, custom events
Some of the events of the first two kinds can be “deferred”, which asks the Operator Framework to re-run the handler later. Not all events can be deferred - for example, action events, secret expired events, and stop events cannot.
The Operator Framework does not have a continuously running main loop that will process deferred or framework generated events on a specific schedule. Instead, Ops is invoked by the Juju agent whenever it receives an event from the Juju controller. This means that deferred events won’t be re-run until the next event arrives from Juju. In some cases, a deferred event will need to await the arrival of the next update-status
hook.
By default, the update-status
interval is five minutes, so this may lead to a five minute delay in processing the deferred event. However, this interval can be configured by admins, so it might be hours or even days, so it’s not safe to assume that the maximum delay before the deferred event handler will re-run is five minutes.
Despite this potential delay, it is not good practice to block event execution by tailing a log or executing a long sleep in the middle of an event handler.
Often, Juju’s event model provides inherent workarounds for a delay.
For example, let’s take a look at a common pattern: configuring a service when a relation-joined
hook fires.
def _on_db_relation_joined(self, event):
try:
self._set_api_key(event.relation)
except ops.pebble.ConnectionError:
event.defer()
return
Assume that _set_api_key
does something sensible, like writing to a file which will trigger the workload service to dynamically load its configuration. What happens the very first time relation-joined
fires after the install
hook, possibly before the workload container becomes ready?
In that particular case, we know that Juju will dispatch a pebble-ready
event when the charm’s container is up and running. pebble-ready
will trigger our deferred event at exactly the moment that it can be run, with no delay.
However, a better pattern here is to call _set_api_key
when all preconditions are met, regardless of which event has fired.
def _on_db_relation_joined(self, event):
try:
self._set_api_key(event.relation)
except ops.pebble.ConnectionError:
logger.debug(“Not setting API key: container not yet ready.”)
# The pebble-ready event will set the key when the container is ready.
return
def _on_container_pebble_ready(self, event):
rel = self.model.get_relation("db")
if rel: # else the relation-joined event will set the key
try:
self._set_api_key(rel)
except ops.pebble.ConnectionError:
logger.warning("Connection to Pebble lost in pebble-ready")
event.defer()
return
However, this isn’t always possible. Consider another common case, where it’s config-changed
that’s triggering the service configuration - although it’s most likely that any error communicating with Pebble is because the container isn’t ready yet, it might also be other causes (perhaps the container is very busy), and a config-changed
event can happen at any time, not just during the setup phase. In this case, we have no choice but to defer:
def _on_config_changed(self, event):
try:
self._set_api_key(self.config['api_key'])
except ops.pebble.ConnectionError:
event.defer()
return
That said …
The Deferrer’s Dilemmas: Consequences of deferring events
Dilemma #1: Lag in Execution
What if the service takes a little while to start, in a way that is not visible to an init daemon like pebble? Rabbitmq, for example, has some work to do when it first starts, and it is wise to do a liveness check before interacting with it. Here’s code that handles cases like this:
def _on_config_changed(self, event):
if not self._liveness_check():
event.defer()
return
self._set_api_key(self.config['api_key'])
Assume that _liveness_check
pings a port that returns a reliable status indicator.
Since we are waiting on two conditions, and pebble-ready
may fire before the second condition is true, we have a decision to make. Defer anyway, or wait explicitly?
The answer is almost always to defer. The liveness_check
is most likely to return False
during an initial deployment, when the Juju agent will receive and process a flurry of relation-joined
hooks, upon which our deferred event can piggyback.
Note that it’s often the case that the service needs more configuration than is found in the charm config - for example, from the relations. In that case, it’s often cleanest to attempt to holistically configure the workload from all of the relevant events (config-changed
, relation-changed
, pebble-ready
, etc), with appropriate guards, and rely on the work being done when everything is ready. For example:
def _on_config_changed(self, event):
self._push_config(event)
def _on_relation_changed(self, event):
self._push_config(event)
def _on_container_pebble_ready(self, event):
self._push_config(event)
def _push_config(self, event):
if not self.config.get("api_key"):
return
if not self.model.get_relation("db"):
return
try:
self._set_api_key(self.config["api_key"], self.model.get_relation["db"])
except ops.pebble.ConnectionError:
event.defer()
return
A better approach to a slow starter than a plain defer()
is to have the service tell Ops (via Juju) that it has started. This can be done with a Pebble custom notice (in Juju 3.4 and above). Ideally, the service has some form of “ready” hook that can be customised to run pebble notify
, but if not a custom script can be deployed to the container that runs the liveness check until it succeeds, then calls pebble notify
. In the charm, the pebble-custom-notice
event can be observed, and that handler can call _set_api_key
.
When Pebble notices are not available, such as in a machine charm, we generally recommend a small number of retries when the charm has reason to expect that the issue will resolve in a very small period of time. When it’s likely that several seconds, or longer, will be required, we recommend considering what other events are likely to arrive after the service is ready, and having each of them take care of the post-service-start work via a common method. If the charm is unlikely to receive other events (remember: you shouldn’t rely on update-status
arriving in time) then we recommend judicious use of defer()
.
Note: in the very simple case where a Charm author is trying to test a service – e.g. loki – and there exists a test charm which needs to wait for the loki service to be ready, there usually aren’t enough events in the model to quickly unstick a deferred action.
There are several solutions:
- Set a high frequency for
update-status
in a testing environment. This is usually an anti-pattern, because it runs the risk of masking other performance problems. - Start a ping-pong of updated relation data. This is also an anti-pattern; relations are not fit to serve as a bus for what is essentially inter-process communication.
- Use jhack ffwd or jhack fire to trigger the event. This requires the participation of a client computer, running outside of the model. In test environments, this computer can be the same one that is driving the test.
- In a Kubernetes sidecar charm, use a Pebble custom notice to have the workload notify the charm (via Juju) that the service is ready.
Dilemma #2: Out of order events
When a charm defers an event, it gets added to a queue of events that are executed before the event that triggered the Ops run. This is done to preserve ordering whenever possible, but it can cause two issues:
Firstly, a deferred event may fire before the event that will ‘fix’ it.
In these cases, there may be an issue with the charm’s logic, and it may be better to refactor the charm to better reflect the order of operations in the Juju cloud.
For example, the Juju controller emits an install
event for every charm, followed by a config_changed
event. If logic in the install
event cannot complete until after config_changed
has been triggered, it is probably necessary to move that logic into the config_changed
handler.
Secondly, deferring one event does not prevent other events in the queue from being processed. One consequence is that a twice deferred event may be executed outside of the ordering contracts that Juju makes.
For example, install
is always followed by config-changed
. If a charm defers the install
handler once, it will execute before any config-changed
handlers. However, if the install
event is deferred again, the config-changed
event may be processed, and the install
hook would then fire in something other than the expected order.
In these cases, ‘later’ events may need to ‘catch-up’ or ‘reconcile’ to ensure that the charm is in the expected state, or check for missing work and defer()
the second event as well. Care needs to be taken when designing the charm logic that it’s not possible to continuously build up a queue of deferred events - at some point the charm needs to give up and ask the user to solve the problem.