It's probably ok for a unit to go into error state

From my previous experience, when units are in an error state, that interferes with model migrations between Juju controllers. It may also interfere with Juju upgrades, although I don’t recall; maybe that’s fine? But model migrations are definitely problematic, and some Juju upgrades also require model migrations as well.

I.e. if there are “expected” cases where something is going to go into an error state, Juju migrations and at least some Juju upgrades are blocked, full stop, until those issues are resolved. That seems very problematic to me, which is why I prefer not to see error states unless a charm is actually breaking due to a bug.

I think @sed-i summarizes the different types of situation well, and I like the concepts of push and pull statuses. I haven’t used collect-status really yet, but many Kubeflow charms implement something similar internally. Where it gets tricky though is those push statuses like @sed-i mentioned - they might change state, so you can’t poll them but instead just try to do them.

I recall in the design for collect-status debate on stateless (to check status, you actually check the current state of things) vs stateful status (whenever something happens you store the status from that, and later something could check that state). I wonder if the debate missed that we actually sometimes need both.

I wonder @sed-i have you ever tried to write a pull status function for things you see as push statuses? I’ve done that a bit and don’t recall if I ever hit a case where it wouldn’t work.

1 Like

Yeah rightly or wrongly I see error state as a “never get here on purpose” sort of thing. If there’s a problem that you foresee and try to handle, that handling shouldn’t be putting the unit into error. The reason being what @vultaire and @sed-i mention - the “this errored event must resolve before anything else” behaviour of error is not always desirable, and the side effects of going to Error are broad.

I can see the appeal of putting a charm to error intentionally though. Error:

  • halts the charm (prevents other things from happening). There’s no built-in way to do this otherwise
  • has a retry mechanism, so you know Juju will try this event again

When you want both of those, it is actually useful (although sometimes what you really want is just one of those, and then Error is again a problem).

To me, the root cause is the lack of built-in ways to:

  • halt parts of a charm until X succeeds
  • retry X in Y seconds

If we had those two tools, many of these problems fall away.

1 Like

Regarding BlockedStatus, I’m not sure why people are so confused about that one. In the Ops API reference, the BlockedStatus docs say:

The unit requires manual intervention.

An admin has to manually intervene to unblock the unit and let it proceed.

Also, the example we give under Application.status is clear too:

self.model.app.status = ops.BlockedStatus('I need a human to come help me')

I like @sed-i’s thoughts here, and his explanation of “pull” vs “push” statuses. Using a little stored state variable for any push statuses is a good idea to augment collect-status (though you’d have to think carefully about how and when you’d clear the push statuses).

I agree we should probably error more than we do. For example, in a pebble-ready handler, it’s expected that you can connect to the container (it’s Pebble ready, after all) so you should probably just let the error bubble up on failed Pebble operations. Or if you’re hitting a K8s or other API that should be active, maybe just letting it fail/error is the right thing to do. If the API is flakey, do say 3 retries of that operation in a simple loop (or use a retry decorator).

But as as others have pointed out, error-ing does interfere with juju remove-*. The error status is also going to show up in juju status, which – if it’s just a temporary glitch – may not be good UX. I agree that a lot of problems could be solved with a proper “defer for X seconds” or “Juju, please retry after X seconds” tool.

1 Like

Edit: This is in response to @benhoyt two comments back. Struggling to format it correctly on my phone.

I think the confusion comes from what human intervention means. I think people see that and think, the charm can’t recover automatically, so therefore a human needs to debug the issue, thus blocked state.

My understanding is that it is not actually for the situation above, but for a situation where the charm can tell the human what needs to be done. Do you agree?

1 Like

Paraphrasing everyone,

Use centralized status setting (the reconcile method), and propagate status changes using StatusException. @carlcsaposs, @ca-scribner

Seems like this approach does not address potential status overwrite on charm re-init.

The automatically-retry-hooks model config is not always set to true. @mthaddon

I was not aware of that model config option. It’s great for dev-env to set to false but in production it’s alway true?

Error status is bad ux / sticky situation difficult to get out of (e.g. blocks model migration, upgrades). @carlcsaposs, @ca-scribner, @vultaire

Agreed. Would need to think well before intentionally choosing error state.

It may be bad ux when an error status bleeps in juju status. @benhoyt

Yes, it would also require in python-libjuju itests to have wait_for_idle(raise_on_error=False) followed up with another wait_for_idle or status assertion. But I’m thinking that if the dev decided to raise and error over commit a blocked, then juju status is probably not a decisive factor?

Openstack charms still are tested with automatic retry disabled to ensure that it comes up without errors as part of regular behavior. I believe they aren’t run that way in the wild, but they do consider hitting Error a CI rejection failure.

I think most cases are clear, but there are some cases where it’s not possible to tell whether manual intervention is needed. Say that you can’t proceed because you need a relation - is that blocked (human needs to run “integrate”) or waiting (Juju needs to set up the integration that has already been requested)?

This topic came up when @benhoyt and I were discussing early analysis into existing usage of defer(). I had suggested that going into error state (perhaps in a more graceful way than ops currently provides for, e.g. no traceback in debug-log) would alleviate some of the need for “juju retry” (he convinced me otherwise, mostly with the same arguments that are in this thread).

Although either I misunderstood him then or his stance has softened over the last couple of weeks, since:

I do feel this is the most compelling case for me (one that @sed-i also mentions). With “purely” Juju events, as far as I understand, you should be able to rely on the snapshot of the model being available. That’s not really the case with Pebble - there are plenty of reasons why your Pebble call might temporarily fail, so you need code like this everywhere (not just pebble-ready, I think):

try:
  container.replan()  # or exec, or add_layer, etc
except ops.pebble.ConnectionError:  # probably also ChangeError, etc
  logger.warning(...)
  event.defer()

Or if you’re willing to have the hook take longer:

@tenacity.retry(
    retry=tenacity.retry_if_exception_type(ops.pebble.ConnectionError),
    wait=tenacity.wait_fixed(2) + tenacity.wait_random(0, 5),
    stop=tenacity.stop_after_delay(10),
    reraise=True,
)
def pebble_thing():
    ...

try:
    pebble_thing()
except ops.pebble.ConnectionError:
    self.unit.status = ops.BlockedStatus("Can't talk to Pebble")  # WaitingStatus? Hard to know if this fixes itself.
    return

(Even then that might need to defer(), since there’s no (jhack-less) way to say "run that event again, Pebble’s ok now. The status is also awkward to set - it’s a “pull” status in that you can do can_connect() but that introduces a race where your main handler failed and your collect status handler succeeded, so really you have to get that failure to your collect status handler some other way).

It does still seem cleaner to me in this case to let the hook exit non-zero and have Juju retry the event and show the state as in error (it is in error - something is broken with Pebble!).

Good point about the race @tony-meyer. Since it would be difficult have “singleton checks”, this seems like an encouragement to use collect-status as a common-exit-hook reconciler. The trouble is that it may introduce unnecessary outage (e.g. config update, restart) every hook.

Another option is for the main handler to convert “pull” statuses into “push” statuses (update _stored), so that collect-status won’t need to do a fresh check.

1 Like

Openstack charms still are tested with automatic retry disabled to ensure that it comes up without errors as part of regular behavior. I believe they aren’t run that way in the wild, but they do consider hitting Error a CI rejection failure.

That’s certainly true for the older machine based OpenStack Charms - we make some pretty extensive use of tools such as tenacity (mentioned by @tony-meyer) to retry calls to services the charm is managing when the service may take “some time” to startup.

For the Sunbeam charms I think we’ve relaxed this stance a bit - we see similar issues with pebble connection errors from time-to-time plus some other misc issues which we think are related to the K8S substrate rather than the charm or workload specifically.

To be clear - I’m not super comfortable with this and feel this is tech-debt that we need to pay off at some point in time!

They way I consider Error state. Is that something is fundamentally off, such that I cannot progress the model of the world. IMO it should imply that both a human needs to get involved, and that it is a situation that the charm couldn’t reasonably predict what the right behavior is. For example, if you need to talk to the K8s api and change something, and the connection to K8s fails, that seems ok for an Error state. You can’t progress the current state of the world. As @vultaire noted, though, it takes that charm out of happy progression, and certainly shouldn’t be used for a simple retry. Juju doesn’t let you juju remove an application in an error state without prompting, because the charm has abdicated responsibility by going into Error. Which means we have no idea whether things will cleanup correctly or not. I could see a case for juju remove-X --ignore-errors but just like --force it is highly likely to be heavily abused and then bugs filed when "I ran juju remove-app --ignore-errors and it left behind all this crap. (Because you asked us to ignore the fact that it was failing to cleanup.)

Real world ops means that going into error state stops you from reacting to what is happening. You might have to do so, but in doing so, you have asked someone else to take over your job. Your job as a charm is to maintain the application, and you have signaled that you are unable to do so. There are times when that is ok (you lost internet connectivity, or broke your leg, it’s ok to let someone know that you can’t work for a while and they need to shepherd what you were working on).

Things that need normal human intervention (bad config, missing relation) should certainly not be errors, and that is what Blocked is for. But missing Pebble or K8s is an interesting one. As those are likely actually transient errors rather than fatal wake-a-human errors. (Or at the very least, having a small number of retries before waking a human would be good. If you really couldn’t reach pebble after 10 retries you really should be failing hard and alerting someone else because you aren’t in an understandable state.)

4 Likes

Thanks @jameinel! I updated with your input.

I wonder what are your thoughts about error vs blocked if we wrap main itself in try-except:

if __name__ == "__main__":
    try:
        main(AlertmanagerCharm)
    except Exception as e:
        logger.error(...)
        self.unit.status = BlockedStatus(...)

For delta-charming this would create a mess, but for holistic reconciler this could perhaps work? Seems like an easy way to “convert” error to blocked.

  • In both “error” and this “blocked” the workload would continue to run as it were.
  • Nice solution to schema mismatch (to be resolved by a future relation-changed).

IMHO this should be, if anywhere, in ops. Because we do still want ops bugs to bubble up, as those are outside of the charm domain (even the holistic one)

But at the moment I don’t see another way to tell juju ‘just for this charm, never stop the world’.

We actively discussed a BlockedStatusException in ops, as a way to “stop the rest of my processing, and just go into blocked state with this message”, rather than using self.status=BlockedStatus(…), and then trying to figure out how to unwind your stack without an exception.

It doesn’t handle multiple statuses, and there are ways that collect-status is a nicer solution, but in the ‘is it easy to do the right thing when you have a simple charm’, having a known exception would be a pretty big win.

I wouldn’t do a generic “catch all Exceptions” because many of them really are exceptions and the data model should stop progressing.

Here is a great example of BlockedStatus being misused where it should probably be an error.

Interesting, because IMO that one is very appropriate for Blocked status. It may be that this application cannot reach Loki, but are there other states of the model that you would like to progress in the meantime? Or is the only reason any event is progressed because it can tell Loki about it (which would be a valid Error state), as you can’t handle any event because there is no where to resolve it.

BlockedStatus is for when human intervention is required. What can the admin do to solve the issue in that case?

Given that the loki endpoints are unreachable, I would expect something around either configuration, or relations that indicate something misconfigured. Or possibly do something to ensure that loki is actually up and running.

It’s not always easy to tell:

  • If unreachable due to e.g. firewall rules, then blocked.
  • If unreachable because some fetch-lib resulted in a subtle unpredictable schema mismatch, then error.
  • If unreachable because loki is being upgraded at exactly this point, then Waiting, but we have no way of knowing it from another charm (and we do not want to use reldata), so error again?

And do we want to prevent advancing the model in all the situations above?

  • On one hand, it’s nice that the error status will cause juju to maintain the context on retry.
  • On the other hand, apart from messaging via relation data, there is no ecosystem-level way of signaling that a piece of the puzzle is in blocked/error status, so it does not help the charm on the other side of the relation in realizing that there’s a problem and that it should take some measures.