A short TLDR:
Charms can benefit from a self-health test mechanism that periodically monitors conditions which cannot be reliably expressed as alert rules, giving administrators early warnings and time to take preventive action.
Story time 
A Christmas Tale of Prometheus
In a land far away, there lived a Prometheus K8s operator that appeared to be doing just fine. Day and night, it watched over the kingdom’s workloads, collecting metrics, counting requests, and faithfully recording every spike, dip, and wobble.
All seemed peaceful.
But the Prometheus charm had a dark secret it couldn’t share with anyone: its disk space was about to run out.
Quietly, invisibly, its disk filled up. Day after day, Prometheus scraped happily, right up until the moment it couldn’t anymore. Without anyone noticing, the disk ran out.
The engineers and admins of the kingdom had but one question: why had there been no indication that the benevolent Prometheus charm was running out of space? Why had they not seen any warnings when looking at juju status?
Okay, back to the real world
This scenario isn’t too far fetched. Imagine you have a charm which manages a workload that needs persistent storage e.g. Prometheus. As with almost all charms and workloads, you need to monitor the health of this particular Prometheus so that you can ensure it operates successfully, both now and in the future. For example, you need to ensure that disk size is plenty at deploy time and that more importantly, you know when disk space is about to run out so you can take preventive actions.
Take the bedtime story above as an example. The essence of the problem was that as Prometheus’ available disk space kept falling, there were no alerts or any other indications to attract the attention of an admin. This particular charm doesn’t come with Node Exporter and depending on the charm, you may not want to add Node Exporter anyway. Although Prometheus exposes its own metrics, there is no metric that tracks available disk space. That means, you cannot rely on alert rules.
Clearly, there are several charm/workload health checks that are best performed by the charm itself. Now, from the perspective of the charm author, the question becomes:
Why can’t the charm itself perform self-health-tests on itself and let the status of that test be known, possibly through
juju status?
Going back to our Prometheus example, in the absence of Node Exporter, we could implement logic in the Prometheus charm code that, on update-status, checks its available disk space. If available space is less than e.g. 2 GiB, we can set status=blocked and display a message in juju status.
And we don’t at all need to stop at disk space. You can periodically (on update-status) perform CPU, memory use, or any other relevant checks and set blocked to capture the attention of an admin. The admin should then be able to determine what exactly needs to be done to ensure the health of the charm. For example, if the charm goes into blocked along with the status message “disk space <1GiB”, then the action they need to take is quite easy to figure out.
Some considerations
Of course, there are some considerations here. Namely, these tests should be lightweight so that running them doesn’t take a significant amount of time. You probably don’t want to run a large number of tests during self health checks. Rather, execute tests that cover blind spots on your alert rules or other forms of monitoring.
In addition, the way I have here envisioned these self tests can lead to a few issues. First of all, many charmers can have strong opinions about what should go in juju status. And many believe that juju status is not an observability dashboard and hence, you shouldn’t look to it to know if your charm is running out of disk space for example. This is reasonable. Ideally, you should have an observability layer which scrapes metric from your charms/workloads and alerts you when things aren’t right. The keyword here is: ideally. In the case of the “Prometheus that looked healthy”, for example, we didn’t have any disk related metrics to begin with. Again, self tests should probably help cover blind spots in health monitoring for things that are absolutely critical to a charm’s health.
Also, consider again our Prometheus instance from before. The idea of exposing possible issues that need intervention through juju status is to capture the attention of an admin who can intervene. However, the admin needs to actually look at the status to realize something needs their attention, unlike the scenario where you have alerts that trigger and send you an email for example. Imagine you have an admin who doesn’t actually need to look at juju status for a few days at a time, perhaps because they have no reason to; everything seems to be working fine. In that case, even if the status is set to blocked, as long as the admin doesn’t look at it, they won’t know that something has gone wrong. In other words, they can’t take action in time.
If you’d like to see a simple implementation of this, have a look at this example from cos-lib.
I’d love some feedback and/or thoughts on the charm self tests! Does anyone think they can benefit from something like this? Any objections to the idea?