It's probably ok for a unit to go into error state

sed-i · 24 January 2024 22:19

When I just started charming, the impression I got was that charm code should try and never let a unit go into error state. Instead, I thought, we should catch all exceptions and go into blocked with an appropriate message (and apparently it wasn’t just me thinking that).

However, pretty soon it became quite clear that this would overcomplicate charm code:

Some charms use centralized status-setting, and some - decentralized.
- with centralized status setting (collect-status or “common exit hook”) it would be difficult to propagate a status change from the line where an exception was caught;
- with decentralized status setting, we could be overwriting an important status message from before.
Even if we successfully prevented an error status, there is not much the juju admin can do about it other than waiting for update-status to retry. Which means charm code would need to maintain some state to have context to continue from on the next update-status.
“Blocked” status isn’t blocking anything - it’s just a cosmetic feature in juju status that the admin needs to know to check, otherwise it will go unnoticed.

Instead, we could just let the charm go into error status. Some advantages:

Juju will automatically keep retrying the failing hook (see juju resolve).
Juju preserves the hook’s original context (e.g. relation data at the time the hook originally fired).

Examples of exceptions that could potentially put the charm in error status

Pebble operations

If a pebble operation such as push fails, the charm should probably go into error state.

Pebble exceptions that are likely to go away the next time juju retries the hook:

pebble.ConnectionError

K8s operations

When charm code interacts directly with k8s (e.g. using lightkube) and some instruction critically fails, the charm should probably go into error state.

Lightkube exceptions that are likely to go away the next time juju retries the hook:

httpx.RemoteProtocolError
httpx.ReadError
httpx.ConnectError

Towards centralized status setting

When charm code (or library code) sets status all over the place, we could be overwriting important messages. A typical workaround is some state-keeping mechanism such as StoredState. But ideally, charm code shouldn’t do that. Charms are perhaps not stateless, but it would be great if we could treat them as amnesiacs.

The collect-status feature helps with that. In the collect-status approach, all statuses should be queryable at any point in time. For this reason, it seems like:

Queryable statuses should be queried via collect-status; everything else should send a charm into error state.

Problems with error status

An error state is a situation in which the charm is unable to take any action to progress the model of the world.

Error state prevents further operations

Juju doesn’t let you simply juju remove an application in an error state because without the charm we wouldn’t know how to cleanup correctly. Upgrading and model migration too won’t be possible while a charm is in error state.

Error state “detaches” the charm from managing the application

Going into error state stops the charm from reacting to what is happening. By going into error state a charm is signalling that it cannot manage the application anymore.

Error state may introduce model departures

When the juju model (e.g. config options, relation data) tells our charm one thing, but the charm does not (or cannot) have it reflected (e.g. update workload filesystem or statefulset), then we are in “model departure”.

In some error state situations, Juju’s preservation of hook context is desirable. In other error state situtations, it is detrimental: correcting a config option or relation data would not help resolving the charm, because juju will continue to retry the hook with the old context.

At this point, manual intervention would be needed to force juju to ignore the failing hook:

juju resolve --no-retry <unit>

but the admin would have to remember to run it only after a corrective action was taken, i.e. only when there is another config-changed or relation-changed in the queue, otherwise we would remain in model-departure land.

A compromise

So there are three kinds of statuses:

“Pull” statuses. Those are statuses that can be queried for (“pulled”) at any point in time. Those are the kinds of statuses that collect-status was originally designed for.
“Push” statuses. Those are statuses that are “pushed” in specific circumstances, and need to be preserved across charm re-init, because they cannot be queried for without attempting a mutating instruction.
Error status. Those are situations where we can’t do anything about the error and are expecting Juju to resolve in the next hook execution, without necessarily requiring manual admin intervention.
- Errors we expect to be transient (pebble push, k8s comms) can we wrapped in retry logic.

In code it may look like this:

def _on_collect_status(self, event: ops.CollectStatusEvent):
    # Collect (non-racy) "pull" statuses
    event.add_status(self.component_a.get_status())
    event.add_status(self.component_b.get_status())

    # Collect "push" statuses (and racy "pull" statuses that 
    # were converted to push statuses)
    for status in self._stored.push_statuses:
        event.add_status(status)
    
    # Error status is not collected - it's just a traceback
    # from an uncaught exception that puts the charm in error.

To summarize,

Use collect-status.
If possible, convert “push” statuses into “pull” statuses.
Make sure racy statuses such as can_connect and is_ready are handled as “push” statuses.
Wrap transient errors such as pebble.ConnectionError with retry logic.
Errors that make the application unmanageable should put the charm in error status.

Implementation example: https://github.com/canonical/prometheus-k8s-operator/pull/561

carlcsaposs · 25 January 2024 09:38

For propgating status, I’ve found the concept of a StatusException quite helpful (example usage 1, example usage 2)

carlcsaposs · 25 January 2024 09:44

One other potential drawback is that error status prevents the user from removing the charm (without --force, aka without cleanup)

In the case of mysql-router, this was important enough for us to use WaitingStatus("[...] Will retry next Juju event") instead of error status. Context: PR description

mthaddon · 25 January 2024 10:09

This depends on model configuration, specifically automatically-retry-hooks. It defaults to true but it’s not set that way everywhere.

ppasotti · 25 January 2024 10:31

Speaking of, what are the use cases/rationales for people to turn it off?

mthaddon · 25 January 2024 10:42

Operational teams at Canonical used to consider it a bug if a charm they maintained ever entered an error state, so would turn this on to track down when errors are happening during charm development to avoid them occurring in production wherever possible. I’m not sure if this is still the case, that’s going back quite a few years now.

dylanstathis · 25 January 2024 10:42

I think I tell someone not to set BlockedStatus about twice per week. Blocked status is supposed to be for when the charm is waiting on a user action (adding a relation, setting a config value, etc.) but it seems to be used for every kind of issue. I wonder if there is a better way to get that message across to charmers?

ca-scribner · 25 January 2024 14:28

re propogating status: similar to what @carlcsaposs cites, kubeflow team uses ErrorWithStatus which feels pretty similar. Openstack’s Sunbeam charms have something like this too but they use it via a context manager which imo feels really clean in a charm (sorry, searched for a link but didn’t see one quickly)

re @carlcsaposs about error and removal - this really hurts the UX imo. New users wont know how to really get rid of an errored charm either, which hurts adoption.

vultaire · 25 January 2024 16:01

From my previous experience, when units are in an error state, that interferes with model migrations between Juju controllers. It may also interfere with Juju upgrades, although I don’t recall; maybe that’s fine? But model migrations are definitely problematic, and some Juju upgrades also require model migrations as well.

I.e. if there are “expected” cases where something is going to go into an error state, Juju migrations and at least some Juju upgrades are blocked, full stop, until those issues are resolved. That seems very problematic to me, which is why I prefer not to see error states unless a charm is actually breaking due to a bug.

ca-scribner · 25 January 2024 16:03

I think @sed-i summarizes the different types of situation well, and I like the concepts of push and pull statuses. I haven’t used collect-status really yet, but many Kubeflow charms implement something similar internally. Where it gets tricky though is those push statuses like @sed-i mentioned - they might change state, so you can’t poll them but instead just try to do them.

I recall in the design for collect-status debate on stateless (to check status, you actually check the current state of things) vs stateful status (whenever something happens you store the status from that, and later something could check that state). I wonder if the debate missed that we actually sometimes need both.

I wonder @sed-i have you ever tried to write a pull status function for things you see as push statuses? I’ve done that a bit and don’t recall if I ever hit a case where it wouldn’t work.

ca-scribner · 25 January 2024 16:12

Yeah rightly or wrongly I see error state as a “never get here on purpose” sort of thing. If there’s a problem that you foresee and try to handle, that handling shouldn’t be putting the unit into error. The reason being what @vultaire and @sed-i mention - the “this errored event must resolve before anything else” behaviour of error is not always desirable, and the side effects of going to Error are broad.

I can see the appeal of putting a charm to error intentionally though. Error:

halts the charm (prevents other things from happening). There’s no built-in way to do this otherwise
has a retry mechanism, so you know Juju will try this event again

When you want both of those, it is actually useful (although sometimes what you really want is just one of those, and then Error is again a problem).

To me, the root cause is the lack of built-in ways to:

halt parts of a charm until X succeeds
retry X in Y seconds

If we had those two tools, many of these problems fall away.

benhoyt · 25 January 2024 16:48

Regarding BlockedStatus, I’m not sure why people are so confused about that one. In the Ops API reference, the BlockedStatus docs say:

The unit requires manual intervention.

An admin has to manually intervene to unblock the unit and let it proceed.

Also, the example we give under Application.status is clear too:

self.model.app.status = ops.BlockedStatus('I need a human to come help me')

benhoyt · 25 January 2024 17:01

I like @sed-i’s thoughts here, and his explanation of “pull” vs “push” statuses. Using a little stored state variable for any push statuses is a good idea to augment collect-status (though you’d have to think carefully about how and when you’d clear the push statuses).

I agree we should probably error more than we do. For example, in a pebble-ready handler, it’s expected that you can connect to the container (it’s Pebble ready, after all) so you should probably just let the error bubble up on failed Pebble operations. Or if you’re hitting a K8s or other API that should be active, maybe just letting it fail/error is the right thing to do. If the API is flakey, do say 3 retries of that operation in a simple loop (or use a retry decorator).

But as as others have pointed out, error-ing does interfere with juju remove-*. The error status is also going to show up in juju status, which – if it’s just a temporary glitch – may not be good UX. I agree that a lot of problems could be solved with a proper “defer for X seconds” or “Juju, please retry after X seconds” tool.

dylanstathis · 25 January 2024 17:49

Edit: This is in response to @benhoyt two comments back. Struggling to format it correctly on my phone.

I think the confusion comes from what human intervention means. I think people see that and think, the charm can’t recover automatically, so therefore a human needs to debug the issue, thus blocked state.

My understanding is that it is not actually for the situation above, but for a situation where the charm can tell the human what needs to be done. Do you agree?

sed-i · 25 January 2024 17:56

Paraphrasing everyone,

Use centralized status setting (the reconcile method), and propagate status changes using StatusException. @carlcsaposs, @ca-scribner

Seems like this approach does not address potential status overwrite on charm re-init.

The automatically-retry-hooks model config is not always set to true. @mthaddon

I was not aware of that model config option. It’s great for dev-env to set to false but in production it’s alway true?

Error status is bad ux / sticky situation difficult to get out of (e.g. blocks model migration, upgrades). @carlcsaposs, @ca-scribner, @vultaire

Agreed. Would need to think well before intentionally choosing error state.

It may be bad ux when an error status bleeps in juju status. @benhoyt

Yes, it would also require in python-libjuju itests to have wait_for_idle(raise_on_error=False) followed up with another wait_for_idle or status assertion. But I’m thinking that if the dev decided to raise and error over commit a blocked, then juju status is probably not a decisive factor?

jameinel · 25 January 2024 20:12

Openstack charms still are tested with automatic retry disabled to ensure that it comes up without errors as part of regular behavior. I believe they aren’t run that way in the wild, but they do consider hitting Error a CI rejection failure.

tony-meyer · 25 January 2024 23:01

I think most cases are clear, but there are some cases where it’s not possible to tell whether manual intervention is needed. Say that you can’t proceed because you need a relation - is that blocked (human needs to run “integrate”) or waiting (Juju needs to set up the integration that has already been requested)?

This topic came up when @benhoyt and I were discussing early analysis into existing usage of defer(). I had suggested that going into error state (perhaps in a more graceful way than ops currently provides for, e.g. no traceback in debug-log) would alleviate some of the need for “juju retry” (he convinced me otherwise, mostly with the same arguments that are in this thread).

Although either I misunderstood him then or his stance has softened over the last couple of weeks, since:

I do feel this is the most compelling case for me (one that @sed-i also mentions). With “purely” Juju events, as far as I understand, you should be able to rely on the snapshot of the model being available. That’s not really the case with Pebble - there are plenty of reasons why your Pebble call might temporarily fail, so you need code like this everywhere (not just pebble-ready, I think):

try:
  container.replan()  # or exec, or add_layer, etc
except ops.pebble.ConnectionError:  # probably also ChangeError, etc
  logger.warning(...)
  event.defer()

Or if you’re willing to have the hook take longer:

@tenacity.retry(
    retry=tenacity.retry_if_exception_type(ops.pebble.ConnectionError),
    wait=tenacity.wait_fixed(2) + tenacity.wait_random(0, 5),
    stop=tenacity.stop_after_delay(10),
    reraise=True,
)
def pebble_thing():
    ...

try:
    pebble_thing()
except ops.pebble.ConnectionError:
    self.unit.status = ops.BlockedStatus("Can't talk to Pebble")  # WaitingStatus? Hard to know if this fixes itself.
    return

(Even then that might need to defer(), since there’s no (jhack-less) way to say "run that event again, Pebble’s ok now. The status is also awkward to set - it’s a “pull” status in that you can do can_connect() but that introduces a race where your main handler failed and your collect status handler succeeded, so really you have to get that failure to your collect status handler some other way).

It does still seem cleaner to me in this case to let the hook exit non-zero and have Juju retry the event and show the state as in error (it is in error - something is broken with Pebble!).

sed-i · 26 January 2024 02:09

Good point about the race @tony-meyer. Since it would be difficult have “singleton checks”, this seems like an encouragement to use collect-status as a common-exit-hook reconciler. The trouble is that it may introduce unnecessary outage (e.g. config update, restart) every hook.

Another option is for the main handler to convert “pull” statuses into “push” statuses (update _stored), so that collect-status won’t need to do a fresh check.

james-page · 29 January 2024 12:09

Openstack charms still are tested with automatic retry disabled to ensure that it comes up without errors as part of regular behavior. I believe they aren’t run that way in the wild, but they do consider hitting Error a CI rejection failure.

That’s certainly true for the older machine based OpenStack Charms - we make some pretty extensive use of tools such as tenacity (mentioned by @tony-meyer) to retry calls to services the charm is managing when the service may take “some time” to startup.

For the Sunbeam charms I think we’ve relaxed this stance a bit - we see similar issues with pebble connection errors from time-to-time plus some other misc issues which we think are related to the K8S substrate rather than the charm or workload specifically.

To be clear - I’m not super comfortable with this and feel this is tech-debt that we need to pay off at some point in time!

jameinel · 30 January 2024 14:10

They way I consider Error state. Is that something is fundamentally off, such that I cannot progress the model of the world. IMO it should imply that both a human needs to get involved, and that it is a situation that the charm couldn’t reasonably predict what the right behavior is. For example, if you need to talk to the K8s api and change something, and the connection to K8s fails, that seems ok for an Error state. You can’t progress the current state of the world. As @vultaire noted, though, it takes that charm out of happy progression, and certainly shouldn’t be used for a simple retry. Juju doesn’t let you juju remove an application in an error state without prompting, because the charm has abdicated responsibility by going into Error. Which means we have no idea whether things will cleanup correctly or not. I could see a case for juju remove-X --ignore-errors but just like --force it is highly likely to be heavily abused and then bugs filed when "I ran juju remove-app --ignore-errors and it left behind all this crap. (Because you asked us to ignore the fact that it was failing to cleanup.)

Real world ops means that going into error state stops you from reacting to what is happening. You might have to do so, but in doing so, you have asked someone else to take over your job. Your job as a charm is to maintain the application, and you have signaled that you are unable to do so. There are times when that is ok (you lost internet connectivity, or broke your leg, it’s ok to let someone know that you can’t work for a while and they need to shepherd what you were working on).

Things that need normal human intervention (bad config, missing relation) should certainly not be errors, and that is what Blocked is for. But missing Pebble or K8s is an interesting one. As those are likely actually transient errors rather than fatal wake-a-human errors. (Or at the very least, having a small number of retries before waking a human would be good. If you really couldn’t reach pebble after 10 retries you really should be failing hard and alerting someone else because you aren’t in an understandable state.)