Warning - opinion ahead!
The error state is useful for charm authors, but not helpful for production. If I’m someone who deploys charms (but not write them), I don’t have the expertise to fix the problem.
Here’s a relatively common pattern for resolving charm errors in a live model:
- run
juju debug-log --include <unit-id>
(but how did I learn about this command? it’s not included within any of the charm’s documentation) - run
juju debug-hook
sends me into a tmux session (what if I’ve never used tmux?) - edit the charm (but then I need to where to find its source code. How do people learn to look in
/var/lib/juju/...
?) - cross fingers
- then run
juju resolve
It can be simpler to remove an application and re-deploy. But this can impose significant downtime in production.
The error state is a very powerful hammer. It suspends the charm, and isolates the ability for a poorly-coded charm to do further damage. But that doesn’t really help people who are running live workloads. @zicklag has a very relevant anecdote:
Juju charms, for safety purposes have a rather conservative behavior when any charm hook fails. The charm will go into an error state and essentially freeze all operations ( other than retrying the previous operation ) to prevent data loss. The issue with this is that that it tends to result in cascading irrecoverable error states when faced with a bug in the charm.
For example, I ran into a situation where I had an HTTP proxy charm that had a bug that caused a hook failure under certain conditions. I related this HTTP proxy charm ( not knowing about the bug ) to a Grafana charm. When the proxy charm went into an errored state the only way to fix it was to force remove the charm. I force removed the proxy charm, but then Grafana’s hook errored out, not because of a bug in Grafana, but because the charm’s Juju agent was eternally trying to respond to a relation hook event on a relation that no longer existed, since the force removal of the proxy.
My only recourse was to force remove Grafana. If that Grafana charm had been related to a database, the database charm probably would have errored and had to be force removed as well, and so on.
That only happened in a dev environment but that is a very scary thing to put into production. One unforeseen situation in charm code could necessitate the removal of my entire production stack. And that’s not all.
Is this something that’s fixable? What are your suggestions for improving the situation?