It's probably ok for a unit to go into error state

sed-i · 30 January 2024 19:18

Thanks @jameinel! I updated with your input.

I wonder what are your thoughts about error vs blocked if we wrap main itself in try-except:

if __name__ == "__main__":
    try:
        main(AlertmanagerCharm)
    except Exception as e:
        logger.error(...)
        self.unit.status = BlockedStatus(...)

For delta-charming this would create a mess, but for holistic reconciler this could perhaps work? Seems like an easy way to “convert” error to blocked.

In both “error” and this “blocked” the workload would continue to run as it were.
Nice solution to schema mismatch (to be resolved by a future relation-changed).

ppasotti · 31 January 2024 08:37

IMHO this should be, if anywhere, in ops. Because we do still want ops bugs to bubble up, as those are outside of the charm domain (even the holistic one)

But at the moment I don’t see another way to tell juju ‘just for this charm, never stop the world’.

jameinel · 31 January 2024 15:56

We actively discussed a BlockedStatusException in ops, as a way to “stop the rest of my processing, and just go into blocked state with this message”, rather than using self.status=BlockedStatus(…), and then trying to figure out how to unwind your stack without an exception.

It doesn’t handle multiple statuses, and there are ways that collect-status is a nicer solution, but in the ‘is it easy to do the right thing when you have a simple charm’, having a known exception would be a pretty big win.

I wouldn’t do a generic “catch all Exceptions” because many of them really are exceptions and the data model should stop progressing.

dylanstathis · 1 February 2024 13:18

Here is a great example of BlockedStatus being misused where it should probably be an error.

jameinel · 7 February 2024 16:44

Interesting, because IMO that one is very appropriate for Blocked status. It may be that this application cannot reach Loki, but are there other states of the model that you would like to progress in the meantime? Or is the only reason any event is progressed because it can tell Loki about it (which would be a valid Error state), as you can’t handle any event because there is no where to resolve it.

dylanstathis · 12 February 2024 12:34

BlockedStatus is for when human intervention is required. What can the admin do to solve the issue in that case?

jameinel · 12 February 2024 17:40

Given that the loki endpoints are unreachable, I would expect something around either configuration, or relations that indicate something misconfigured. Or possibly do something to ensure that loki is actually up and running.

sed-i · 12 February 2024 19:08

It’s not always easy to tell:

If unreachable due to e.g. firewall rules, then blocked.
If unreachable because some fetch-lib resulted in a subtle unpredictable schema mismatch, then error.
If unreachable because loki is being upgraded at exactly this point, then Waiting, but we have no way of knowing it from another charm (and we do not want to use reldata), so error again?

And do we want to prevent advancing the model in all the situations above?

On one hand, it’s nice that the error status will cause juju to maintain the context on retry.
On the other hand, apart from messaging via relation data, there is no ecosystem-level way of signaling that a piece of the puzzle is in blocked/error status, so it does not help the charm on the other side of the relation in realizing that there’s a problem and that it should take some measures.

dylanstathis · 12 February 2024 19:29

It was perhaps a bad example. We ended up removing the code because of the reasons outlined by @sed-i above.

beliaev-maksim · 5 March 2024 20:59

I was triggered by the title, so, have to put my 2 cents anyway.

I do not think that going into error state is a good practice, that basically means we do not know how to handle the situation and thus, we fail.
Blocked status is not a solution either, that is super explicit from the documentation that Blocked requires admin to go and unblock.

The best behaviour (I admit not always easy to achieve) is that charm code catches all the errors and knows how to react on those to overcome the issue or indeed reports to the admin what needs to be done in order to improve the situation.

+100 to Tom’s message, in development you should even disable retry hooks and develop charm for zero-error scenario, and when error happens, find the root cause and the solution. Charm should be plug and play solution for the user, zero ops, zero pain. All this burden is on the charm developer

ppasotti · 6 March 2024 06:52

Spoken like one who’s not a charm developer

Jokes aside, I have a lot of sympathy for the no-blocked/no-error policy. The promise of charms is that they lift operator knowledge. An operator would know how to fix <insert error>. We can’t just say it’s OK to tell the end user (who’s not an operator) to go figure it out.

Of course there’ll always be edge cases that slip through our fingers but those we call bugs, and will be fixed in the next revision of the charm.

jameinel · 6 March 2024 18:02

As an interesting anecdote in this space:

Is specifically because the charm cannot progress past the “django-app-pebble-ready” phase because it iis failing with an exception trying to connect to the loki_push_api. We’re trying to work through whether you could run actions while a hook is failing (theoretically you should), but because it went into error that app is unable to move forward on anything else.

I don’t know this for sure, but I would guess that ‘django-app’ probably could be doing a lot of things that isn’t forwarding its logs to loki in the meantime.