Charm self tests: the story of the Prometheus that looked healthy

A short TLDR:

Charms can benefit from a self-health test mechanism that periodically monitors conditions which cannot be reliably expressed as alert rules, giving administrators early warnings and time to take preventive action.

Story time :tada:

A Christmas Tale of Prometheus

In a land far away, there lived a Prometheus K8s operator that appeared to be doing just fine. Day and night, it watched over the kingdom’s workloads, collecting metrics, counting requests, and faithfully recording every spike, dip, and wobble.

All seemed peaceful.

But the Prometheus charm had a dark secret it couldn’t share with anyone: its disk space was about to run out.

Quietly, invisibly, its disk filled up. Day after day, Prometheus scraped happily, right up until the moment it couldn’t anymore. Without anyone noticing, the disk ran out.

The engineers and admins of the kingdom had but one question: why had there been no indication that the benevolent Prometheus charm was running out of space? Why had they not seen any warnings when looking at juju status?

Okay, back to the real world

This scenario isn’t too far fetched. Imagine you have a charm which manages a workload that needs persistent storage e.g. Prometheus. As with almost all charms and workloads, you need to monitor the health of this particular Prometheus so that you can ensure it operates successfully, both now and in the future. For example, you need to ensure that disk size is plenty at deploy time and that more importantly, you know when disk space is about to run out so you can take preventive actions.

Take the bedtime story above as an example. The essence of the problem was that as Prometheus’ available disk space kept falling, there were no alerts or any other indications to attract the attention of an admin. This particular charm doesn’t come with Node Exporter and depending on the charm, you may not want to add Node Exporter anyway. Although Prometheus exposes its own metrics, there is no metric that tracks available disk space. That means, you cannot rely on alert rules.

Clearly, there are several charm/workload health checks that are best performed by the charm itself. Now, from the perspective of the charm author, the question becomes:

Why can’t the charm itself perform self-health-tests on itself and let the status of that test be known, possibly through juju status?

Going back to our Prometheus example, in the absence of Node Exporter, we could implement logic in the Prometheus charm code that, on update-status, checks its available disk space. If available space is less than e.g. 2 GiB, we can set status=blocked and display a message in juju status. And we don’t at all need to stop at disk space. You can periodically (on update-status) perform CPU, memory use, or any other relevant checks and set blocked to capture the attention of an admin. The admin should then be able to determine what exactly needs to be done to ensure the health of the charm. For example, if the charm goes into blocked along with the status message “disk space <1GiB”, then the action they need to take is quite easy to figure out.

Some considerations

Of course, there are some considerations here. Namely, these tests should be lightweight so that running them doesn’t take a significant amount of time. You probably don’t want to run a large number of tests during self health checks. Rather, execute tests that cover blind spots on your alert rules or other forms of monitoring.

In addition, the way I have here envisioned these self tests can lead to a few issues. First of all, many charmers can have strong opinions about what should go in juju status. And many believe that juju status is not an observability dashboard and hence, you shouldn’t look to it to know if your charm is running out of disk space for example. This is reasonable. Ideally, you should have an observability layer which scrapes metric from your charms/workloads and alerts you when things aren’t right. The keyword here is: ideally. In the case of the “Prometheus that looked healthy”, for example, we didn’t have any disk related metrics to begin with. Again, self tests should probably help cover blind spots in health monitoring for things that are absolutely critical to a charm’s health.

Also, consider again our Prometheus instance from before. The idea of exposing possible issues that need intervention through juju status is to capture the attention of an admin who can intervene. However, the admin needs to actually look at the status to realize something needs their attention, unlike the scenario where you have alerts that trigger and send you an email for example. Imagine you have an admin who doesn’t actually need to look at juju status for a few days at a time, perhaps because they have no reason to; everything seems to be working fine. In that case, even if the status is set to blocked, as long as the admin doesn’t look at it, they won’t know that something has gone wrong. In other words, they can’t take action in time.

If you’d like to see a simple implementation of this, have a look at this example from cos-lib.

I’d love some feedback and/or thoughts on the charm self tests! Does anyone think they can benefit from something like this? Any objections to the idea?

5 Likes

Hey @sinap Nice write-up!

As you already mentioned, there are probably a lot of opinions on using juju status for displaying these kind of operationally relevant messages. As the general idea is to display blocked statuses if some kind of user/admin intervention is required, I can understand the desire to also display issues like the one you described.

Still, from both an operational perspective and my personal experience, the only reliable way to prevent a system from failing is to have automated alerts that trigger support from wherever they are to come to the machine and fix the issue. These must also work on weekends or at night, when nobody is there to watch juju status, and disk space might fill up to the limit when a system is no longer operational.

This does not mean that I’m lobbying against charm self tests in general. If there is a reliable way of identifying issues from operator code, this can only help to improve or speed up the troubleshooting process. I’d rather say that displaying the self test result in juju status should not be the only way of transmitting the message to users/admins. Instead, logging a clear yet detailed-enough message can be a way of supporting both - display to user and trigger alerting (e.g. via loki → alert manager).

A few more things to consider:

  • Self tests on update-status should not be the only way of observational tooling for critical system metrics, remember that the frequency of update-status is configurable and might not be appropriate for observing system metrics
  • If possible, operator code should not just alert, but also self-heal, as the goal is to automate operation tasks. For example, if disk space is eaten up, how about log compaction?
  • (my own) Rule of thumb for determining whether to display the issue in juju status or not: Did the issue occur because something was initiated by a user/admin (e.g. create a relation, run an action)? If yes, then display a status. If not, then the status will probably be displayed into the void.
  • Ensure to also have observability for the observability tooling. As in your case, how do people get alerted if the observability tooling doesn’t work anymore?

Nevertheless, this is an interesting discussion and we all can only benefit from other people’s experience and perspective on operational practices.

2 Likes

Totally agree with what Rene writes.

  • a charm isn’t an exporter: we can’t rely on charm-driven checks to report issues because event frequency isn’t stable or reliable
  • juju status isn’t an observability dashboard (in many ways)
  • we’re drifting away from the paradigm “juju could go away and your stack would remain intact“. Say your juju controller is down or has connectivity issues. In theory that shouldn’t have any impact on your prometheus deployment. But if you rely on juju status for monitoring, now it does.

If prometheus isn’t offering disk usage metrics, that looks to me like a reason to always include a node-exporter with it, or a stripped-down custom exporter that offers the metrics we need to write a proper alert, or contribute such a metric to the upstream? Or either way, solve the issue in workload-land, not in operator town.

I’m afraid of conflating the juju (operator, infra layer) status with workload status. That can’t possibly scale well, and it’s guaranteed to backfire.

1 Like