When I just started charming, the impression I got was that charm code should try and never let a unit go into error state. Instead, I thought, we should catch all exceptions and go into
blocked with an appropriate message (and apparently it wasn’t just me thinking that).
However, pretty soon it became quite clear that this would overcomplicate charm code:
- Some charms use centralized status-setting, and some - decentralized.
- with centralized status setting (
collect-statusor “common exit hook”) it would be difficult to propagate a status change from the line where an exception was caught;
- with decentralized status setting, we could be overwriting an important status message from before.
- with centralized status setting (
- Even if we successfully prevented an error status, there is not much the juju admin can do about it other than waiting for update-status to retry. Which means charm code would need to maintain some state to have context to continue from on the next update-status.
- “Blocked” status isn’t blocking anything - it’s just a cosmetic feature in
juju statusthat the admin needs to know to check, otherwise it will go unnoticed.
Instead, we could just let the charm go into error status. Some advantages:
- Juju will automatically keep retrying the failing hook (see
- Juju preserves the hook’s original context (e.g. relation data at the time the hook originally fired).
If a pebble operation such as
push fails, the charm should probably go into error state.
Pebble exceptions that are likely to go away the next time juju retries the hook:
When charm code interacts directly with k8s (e.g. using lightkube) and some instruction critically fails, the charm should probably go into error state.
Lightkube exceptions that are likely to go away the next time juju retries the hook:
When charm code (or library code) sets status all over the place, we could be overwriting important messages. A typical workaround is some state-keeping mechanism such as StoredState. But ideally, charm code shouldn’t do that. Charms are perhaps not stateless, but it would be great if we could treat them as amnesiacs.
collect-status feature helps with that. In the collect-status approach, all statuses should be queryable at any point in time. For this reason, it seems like:
Queryable statuses should be queried via collect-status; everything else should send a charm into error state.
An error state is a situation in which the charm is unable to take any action to progress the model of the world.
Juju doesn’t let you simply
juju remove an application in an error state because without the charm we wouldn’t know how to cleanup correctly. Upgrading and model migration too won’t be possible while a charm is in error state.
Going into error state stops the charm from reacting to what is happening. By going into error state a charm is signalling that it cannot manage the application anymore.
When the juju model (e.g. config options, relation data) tells our charm one thing, but the charm does not (or cannot) have it reflected (e.g. update workload filesystem or statefulset), then we are in “model departure”.
In some error state situations, Juju’s preservation of hook context is desirable. In other error state situtations, it is detrimental: correcting a config option or relation data would not help resolving the charm, because juju will continue to retry the hook with the old context.
At this point, manual intervention would be needed to force juju to ignore the failing hook:
juju resolve --no-retry <unit>
but the admin would have to remember to run it only after a corrective action was taken, i.e. only when there is another
relation-changed in the queue, otherwise we would remain in model-departure land.
So there are three kinds of statuses:
- “Pull” statuses. Those are statuses that can be queried for (“pulled”) at any point in time. Those are the kinds of statuses that
collect-statuswas originally designed for.
- “Push” statuses. Those are statuses that are “pushed” in specific circumstances, and need to be preserved across charm re-init, because they cannot be queried for without attempting a mutating instruction.
- Error status. Those are situations where we can’t do anything about the error and are expecting Juju to resolve in the next hook execution, without necessarily requiring manual admin intervention.
- Errors we expect to be transient (pebble push, k8s comms) can we wrapped in retry logic.
In code it may look like this:
def _on_collect_status(self, event: ops.CollectStatusEvent):
# Collect (non-racy) "pull" statuses
# Collect "push" statuses (and racy "pull" statuses that
# were converted to push statuses)
for status in self._stored.push_statuses:
# Error status is not collected - it's just a traceback
# from an uncaught exception that puts the charm in error.
- If possible, convert “push” statuses into “pull” statuses.
- Make sure racy statuses such as
is_readyare handled as “push” statuses.
- Wrap transient errors such as
pebble.ConnectionErrorwith retry logic.
- Errors that make the application unmanageable should put the charm in error status.
Implementation example: https://github.com/canonical/prometheus-k8s-operator/pull/561