When I just started charming, the impression I got was that charm code should try and never let a unit go into error state. Instead, I thought, we should catch all exceptions and go into blocked
with an appropriate message (and apparently it wasn’t just me thinking that).
However, pretty soon it became quite clear that this would overcomplicate charm code:
- Some charms use centralized status-setting, and some - decentralized.
- with centralized status setting (
collect-status
or “common exit hook”) it would be difficult to propagate a status change from the line where an exception was caught; - with decentralized status setting, we could be overwriting an important status message from before.
- with centralized status setting (
- Even if we successfully prevented an error status, there is not much the juju admin can do about it other than waiting for update-status to retry. Which means charm code would need to maintain some state to have context to continue from on the next update-status.
- “Blocked” status isn’t blocking anything - it’s just a cosmetic feature in
juju status
that the admin needs to know to check, otherwise it will go unnoticed.
Instead, we could just let the charm go into error status. Some advantages:
- Juju will automatically keep retrying the failing hook (see
juju resolve
). - Juju preserves the hook’s original context (e.g. relation data at the time the hook originally fired).
Examples of exceptions that could potentially put the charm in error status
Pebble operations
If a pebble operation such as push
fails, the charm should probably go into error state.
Pebble exceptions that are likely to go away the next time juju retries the hook:
pebble.ConnectionError
K8s operations
When charm code interacts directly with k8s (e.g. using lightkube) and some instruction critically fails, the charm should probably go into error state.
Lightkube exceptions that are likely to go away the next time juju retries the hook:
httpx.RemoteProtocolError
httpx.ReadError
httpx.ConnectError
Towards centralized status setting
When charm code (or library code) sets status all over the place, we could be overwriting important messages. A typical workaround is some state-keeping mechanism such as StoredState. But ideally, charm code shouldn’t do that. Charms are perhaps not stateless, but it would be great if we could treat them as amnesiacs.
The collect-status
feature helps with that. In the collect-status approach, all statuses should be queryable at any point in time. For this reason, it seems like:
Queryable statuses should be queried via collect-status; everything else should send a charm into error state.
Problems with error status
An error state is a situation in which the charm is unable to take any action to progress the model of the world.
Error state prevents further operations
Juju doesn’t let you simply juju remove
an application in an error state because without the charm we wouldn’t know how to cleanup correctly. Upgrading and model migration too won’t be possible while a charm is in error state.
Error state “detaches” the charm from managing the application
Going into error state stops the charm from reacting to what is happening. By going into error state a charm is signalling that it cannot manage the application anymore.
Error state may introduce model departures
When the juju model (e.g. config options, relation data) tells our charm one thing, but the charm does not (or cannot) have it reflected (e.g. update workload filesystem or statefulset), then we are in “model departure”.
In some error state situations, Juju’s preservation of hook context is desirable. In other error state situtations, it is detrimental: correcting a config option or relation data would not help resolving the charm, because juju will continue to retry the hook with the old context.
At this point, manual intervention would be needed to force juju to ignore the failing hook:
juju resolve --no-retry <unit>
but the admin would have to remember to run it only after a corrective action was taken, i.e. only when there is another config-changed
or relation-changed
in the queue, otherwise we would remain in model-departure land.
A compromise
So there are three kinds of statuses:
- “Pull” statuses. Those are statuses that can be queried for (“pulled”) at any point in time. Those are the kinds of statuses that
collect-status
was originally designed for. - “Push” statuses. Those are statuses that are “pushed” in specific circumstances, and need to be preserved across charm re-init, because they cannot be queried for without attempting a mutating instruction.
- Error status. Those are situations where we can’t do anything about the error and are expecting Juju to resolve in the next hook execution, without necessarily requiring manual admin intervention.
- Errors we expect to be transient (pebble push, k8s comms) can we wrapped in retry logic.
In code it may look like this:
def _on_collect_status(self, event: ops.CollectStatusEvent):
# Collect (non-racy) "pull" statuses
event.add_status(self.component_a.get_status())
event.add_status(self.component_b.get_status())
# Collect "push" statuses (and racy "pull" statuses that
# were converted to push statuses)
for status in self._stored.push_statuses:
event.add_status(status)
# Error status is not collected - it's just a traceback
# from an uncaught exception that puts the charm in error.
To summarize,
- Use
collect-status
. - If possible, convert “push” statuses into “pull” statuses.
- Make sure racy statuses such as
can_connect
andis_ready
are handled as “push” statuses. - Wrap transient errors such as
pebble.ConnectionError
with retry logic. - Errors that make the application unmanageable should put the charm in error status.
Implementation example: https://github.com/canonical/prometheus-k8s-operator/pull/561