It's probably ok for a unit to go into error state

One other potential drawback is that error status prevents the user from removing the charm (without --force, aka without cleanup)

In the case of mysql-router, this was important enough for us to use WaitingStatus("[...] Will retry next Juju event") instead of error status. Context: PR description

This depends on model configuration, specifically automatically-retry-hooks. It defaults to true but it’s not set that way everywhere.

1 Like

Speaking of, what are the use cases/rationales for people to turn it off?

1 Like

Operational teams at Canonical used to consider it a bug if a charm they maintained ever entered an error state, so would turn this on to track down when errors are happening during charm development to avoid them occurring in production wherever possible. I’m not sure if this is still the case, that’s going back quite a few years now.

1 Like

I think I tell someone not to set BlockedStatus about twice per week. Blocked status is supposed to be for when the charm is waiting on a user action (adding a relation, setting a config value, etc.) but it seems to be used for every kind of issue. I wonder if there is a better way to get that message across to charmers?

re propogating status: similar to what @carlcsaposs cites, kubeflow team uses ErrorWithStatus which feels pretty similar. Openstack’s Sunbeam charms have something like this too but they use it via a context manager which imo feels really clean in a charm (sorry, searched for a link but didn’t see one quickly)

re @carlcsaposs about error and removal - this really hurts the UX imo. New users wont know how to really get rid of an errored charm either, which hurts adoption.

1 Like

From my previous experience, when units are in an error state, that interferes with model migrations between Juju controllers. It may also interfere with Juju upgrades, although I don’t recall; maybe that’s fine? But model migrations are definitely problematic, and some Juju upgrades also require model migrations as well.

I.e. if there are “expected” cases where something is going to go into an error state, Juju migrations and at least some Juju upgrades are blocked, full stop, until those issues are resolved. That seems very problematic to me, which is why I prefer not to see error states unless a charm is actually breaking due to a bug.

I think @sed-i summarizes the different types of situation well, and I like the concepts of push and pull statuses. I haven’t used collect-status really yet, but many Kubeflow charms implement something similar internally. Where it gets tricky though is those push statuses like @sed-i mentioned - they might change state, so you can’t poll them but instead just try to do them.

I recall in the design for collect-status debate on stateless (to check status, you actually check the current state of things) vs stateful status (whenever something happens you store the status from that, and later something could check that state). I wonder if the debate missed that we actually sometimes need both.

I wonder @sed-i have you ever tried to write a pull status function for things you see as push statuses? I’ve done that a bit and don’t recall if I ever hit a case where it wouldn’t work.

1 Like

Yeah rightly or wrongly I see error state as a “never get here on purpose” sort of thing. If there’s a problem that you foresee and try to handle, that handling shouldn’t be putting the unit into error. The reason being what @vultaire and @sed-i mention - the “this errored event must resolve before anything else” behaviour of error is not always desirable, and the side effects of going to Error are broad.

I can see the appeal of putting a charm to error intentionally though. Error:

  • halts the charm (prevents other things from happening). There’s no built-in way to do this otherwise
  • has a retry mechanism, so you know Juju will try this event again

When you want both of those, it is actually useful (although sometimes what you really want is just one of those, and then Error is again a problem).

To me, the root cause is the lack of built-in ways to:

  • halt parts of a charm until X succeeds
  • retry X in Y seconds

If we had those two tools, many of these problems fall away.

1 Like

Regarding BlockedStatus, I’m not sure why people are so confused about that one. In the Ops API reference, the BlockedStatus docs say:

The unit requires manual intervention.

An admin has to manually intervene to unblock the unit and let it proceed.

Also, the example we give under Application.status is clear too:

self.model.app.status = ops.BlockedStatus('I need a human to come help me')

I like @sed-i’s thoughts here, and his explanation of “pull” vs “push” statuses. Using a little stored state variable for any push statuses is a good idea to augment collect-status (though you’d have to think carefully about how and when you’d clear the push statuses).

I agree we should probably error more than we do. For example, in a pebble-ready handler, it’s expected that you can connect to the container (it’s Pebble ready, after all) so you should probably just let the error bubble up on failed Pebble operations. Or if you’re hitting a K8s or other API that should be active, maybe just letting it fail/error is the right thing to do. If the API is flakey, do say 3 retries of that operation in a simple loop (or use a retry decorator).

But as as others have pointed out, error-ing does interfere with juju remove-*. The error status is also going to show up in juju status, which – if it’s just a temporary glitch – may not be good UX. I agree that a lot of problems could be solved with a proper “defer for X seconds” or “Juju, please retry after X seconds” tool.

1 Like

Edit: This is in response to @benhoyt two comments back. Struggling to format it correctly on my phone.

I think the confusion comes from what human intervention means. I think people see that and think, the charm can’t recover automatically, so therefore a human needs to debug the issue, thus blocked state.

My understanding is that it is not actually for the situation above, but for a situation where the charm can tell the human what needs to be done. Do you agree?

1 Like

Paraphrasing everyone,

Use centralized status setting (the reconcile method), and propagate status changes using StatusException. @carlcsaposs, @ca-scribner

Seems like this approach does not address potential status overwrite on charm re-init.

The automatically-retry-hooks model config is not always set to true. @mthaddon

I was not aware of that model config option. It’s great for dev-env to set to false but in production it’s alway true?

Error status is bad ux / sticky situation difficult to get out of (e.g. blocks model migration, upgrades). @carlcsaposs, @ca-scribner, @vultaire

Agreed. Would need to think well before intentionally choosing error state.

It may be bad ux when an error status bleeps in juju status. @benhoyt

Yes, it would also require in python-libjuju itests to have wait_for_idle(raise_on_error=False) followed up with another wait_for_idle or status assertion. But I’m thinking that if the dev decided to raise and error over commit a blocked, then juju status is probably not a decisive factor?

Openstack charms still are tested with automatic retry disabled to ensure that it comes up without errors as part of regular behavior. I believe they aren’t run that way in the wild, but they do consider hitting Error a CI rejection failure.

I think most cases are clear, but there are some cases where it’s not possible to tell whether manual intervention is needed. Say that you can’t proceed because you need a relation - is that blocked (human needs to run “integrate”) or waiting (Juju needs to set up the integration that has already been requested)?

This topic came up when @benhoyt and I were discussing early analysis into existing usage of defer(). I had suggested that going into error state (perhaps in a more graceful way than ops currently provides for, e.g. no traceback in debug-log) would alleviate some of the need for “juju retry” (he convinced me otherwise, mostly with the same arguments that are in this thread).

Although either I misunderstood him then or his stance has softened over the last couple of weeks, since:

I do feel this is the most compelling case for me (one that @sed-i also mentions). With “purely” Juju events, as far as I understand, you should be able to rely on the snapshot of the model being available. That’s not really the case with Pebble - there are plenty of reasons why your Pebble call might temporarily fail, so you need code like this everywhere (not just pebble-ready, I think):

try:
  container.replan()  # or exec, or add_layer, etc
except ops.pebble.ConnectionError:  # probably also ChangeError, etc
  logger.warning(...)
  event.defer()

Or if you’re willing to have the hook take longer:

@tenacity.retry(
    retry=tenacity.retry_if_exception_type(ops.pebble.ConnectionError),
    wait=tenacity.wait_fixed(2) + tenacity.wait_random(0, 5),
    stop=tenacity.stop_after_delay(10),
    reraise=True,
)
def pebble_thing():
    ...

try:
    pebble_thing()
except ops.pebble.ConnectionError:
    self.unit.status = ops.BlockedStatus("Can't talk to Pebble")  # WaitingStatus? Hard to know if this fixes itself.
    return

(Even then that might need to defer(), since there’s no (jhack-less) way to say "run that event again, Pebble’s ok now. The status is also awkward to set - it’s a “pull” status in that you can do can_connect() but that introduces a race where your main handler failed and your collect status handler succeeded, so really you have to get that failure to your collect status handler some other way).

It does still seem cleaner to me in this case to let the hook exit non-zero and have Juju retry the event and show the state as in error (it is in error - something is broken with Pebble!).

Good point about the race @tony-meyer. Since it would be difficult have “singleton checks”, this seems like an encouragement to use collect-status as a common-exit-hook reconciler. The trouble is that it may introduce unnecessary outage (e.g. config update, restart) every hook.

Another option is for the main handler to convert “pull” statuses into “push” statuses (update _stored), so that collect-status won’t need to do a fresh check.

1 Like

Openstack charms still are tested with automatic retry disabled to ensure that it comes up without errors as part of regular behavior. I believe they aren’t run that way in the wild, but they do consider hitting Error a CI rejection failure.

That’s certainly true for the older machine based OpenStack Charms - we make some pretty extensive use of tools such as tenacity (mentioned by @tony-meyer) to retry calls to services the charm is managing when the service may take “some time” to startup.

For the Sunbeam charms I think we’ve relaxed this stance a bit - we see similar issues with pebble connection errors from time-to-time plus some other misc issues which we think are related to the K8S substrate rather than the charm or workload specifically.

To be clear - I’m not super comfortable with this and feel this is tech-debt that we need to pay off at some point in time!

They way I consider Error state. Is that something is fundamentally off, such that I cannot progress the model of the world. IMO it should imply that both a human needs to get involved, and that it is a situation that the charm couldn’t reasonably predict what the right behavior is. For example, if you need to talk to the K8s api and change something, and the connection to K8s fails, that seems ok for an Error state. You can’t progress the current state of the world. As @vultaire noted, though, it takes that charm out of happy progression, and certainly shouldn’t be used for a simple retry. Juju doesn’t let you juju remove an application in an error state without prompting, because the charm has abdicated responsibility by going into Error. Which means we have no idea whether things will cleanup correctly or not. I could see a case for juju remove-X --ignore-errors but just like --force it is highly likely to be heavily abused and then bugs filed when "I ran juju remove-app --ignore-errors and it left behind all this crap. (Because you asked us to ignore the fact that it was failing to cleanup.)

Real world ops means that going into error state stops you from reacting to what is happening. You might have to do so, but in doing so, you have asked someone else to take over your job. Your job as a charm is to maintain the application, and you have signaled that you are unable to do so. There are times when that is ok (you lost internet connectivity, or broke your leg, it’s ok to let someone know that you can’t work for a while and they need to shepherd what you were working on).

Things that need normal human intervention (bad config, missing relation) should certainly not be errors, and that is what Blocked is for. But missing Pebble or K8s is an interesting one. As those are likely actually transient errors rather than fatal wake-a-human errors. (Or at the very least, having a small number of retries before waking a human would be good. If you really couldn’t reach pebble after 10 retries you really should be failing hard and alerting someone else because you aren’t in an understandable state.)

4 Likes

Thanks @jameinel! I updated with your input.

I wonder what are your thoughts about error vs blocked if we wrap main itself in try-except:

if __name__ == "__main__":
    try:
        main(AlertmanagerCharm)
    except Exception as e:
        logger.error(...)
        self.unit.status = BlockedStatus(...)

For delta-charming this would create a mess, but for holistic reconciler this could perhaps work? Seems like an easy way to “convert” error to blocked.

  • In both “error” and this “blocked” the workload would continue to run as it were.
  • Nice solution to schema mismatch (to be resolved by a future relation-changed).

IMHO this should be, if anywhere, in ops. Because we do still want ops bugs to bubble up, as those are outside of the charm domain (even the holistic one)

But at the moment I don’t see another way to tell juju ‘just for this charm, never stop the world’.