Most charmers have at some point had to help admins, users, or any charm enthusiasts debug a charm-related problem. Consider the good ol’ Prometheus charm for K8s. Throughout its long and fruitful life as one of the key charms in the Lite brand of the Canonical Observability Stack (COS), it has seen its fair share of questions, issues, and enhancement proposals. Needless to say, from time to time, people reach out on the repo for possible tips to work around bugs or to report errors. And usually, when people reach out to report bugs, errors, or general issues they have run into when running the charm, one of the first lines of investigation from a maintainer’s point of view is to ask the reporter to supply some or all of:
- The
juju status --relationsresults. - The
juju configoptions set. - The command-line arguments that Prometheus is running with.
- The
pebble layer. - Relation data.
- The output of
juju show-unit. - Aaaand perhaps most importantly, the workload config file. In Prometheus’ case, that’s located at
/etc/prometheus/prometheus.yml.
Of course, sometimes the issue can be completely different and all of these may not even come close to helping you. Regardless, you see an issue… right? In a nutshell, the charm devs certainly know (in most cases) roughly what they need to start diagnosing the issue in search of a root case and fix. However, there is no guarantee that the reporter does as well. Furthermore, how many rounds of back and forth async communication will it take until a maintainer feels like they have all the info and config needed to look for a solution? Not to mention, the reporter of the issue shouldn’t feel like they need to shell into the container, grab the config file, the pebble layer, and a whole bunch of other configured values, paste them into a file, format them, and only then supply them to a maintainer.
Problems
The point is, there seem to be two problems at hand:
- When an issue is created on Github, we need to make it well-known what exactly should be supplied before we can effectively approach the issue. Think of - almost - all the repos you’ve ever opened an issue on. They ask you to include the environment and workload version when you’re opening an issue for example. Similarly, charm repos need to make it known all that is needed from
juju status, the workload config, andpebble layerwhen an issue is created. - Even if the reporter knows what they need to provide, they may not know exactly where to look and/or if they think it’s too much, they may give up on it.
The state to get is: the charm should provide an admin with an easy way of collecting vital configurations from it.
The question now is: what is the best way of doing this?
Let’s explore some of the options that may immediately come to mind. Needless to say, I appreciate your feedback on these ![]()
Put your requirements directly in the issue template
Fixes problem 1
Doesn’t fix problem 2
This one seems easy enough: you define a couple of mandatory fields in your issue template which the user has to fill in and you mandate them to provide, for instance, the workload config when creating an issue. And you clarify where exactly the reporter needs to go to fetch exactly what you need from them.
Of course, while this seemingly gives you all you need when you want to troubleshoot, it still forces the reporter to take on manual and possibly time-consuming time to perform multiple steps.
Another drawback of this approach is that you need to place knowledge about your charm and workload in a Github issue template of all places and in case of updates, it may be easy to forget to update them.
Alternatively, you could have landing/documentation/reference/explanation pages on readthedocs for your charms where you tell people where they need to look to gather “troubleshoot-worthy” configs, but again, much of the burden is still on the person experiencing the issue to first educate themselves on what they need to do to help you help them.
This brings us to another possible solution. The charm knows all about the configs applied to both itself and its workload. Shouldn’t it tell us what we need to look at to diagnose an issue?
Supply the necessary configs through a juju action
Consider this example from the K8s charm for Blackbox Exporter. Also consider the wording used to descibe this action:
To verify Blackbox Exporter is using the expected configuration you can use the
show-configoption
For most workload managing charms, the operator relies on juju config options, relation data, and the charm author’s understanding of the workload to write a workload config file, configure envvars, and do a whole bunch of other things. All of these can impact the manner in which the charm or workload behave which may lead to unexpected issues for users. So, why can’t the charm author have an action, similar to the show-config action in Blackbox Exporter K8s which verifies that both the charm and workload are indeed configured correctly.
The Blackbox Exporter example is actually a rather simple one because all the action does is it returns what’s at /etc/blackbox_exporter/config.yml. For most charms and workloads, that config file is a good start, but there are many more things that can help gain a better understanding. So consider the example below.
I have a shiny workload operated by a shiny charm. I configured a few Juju options, added a couple of relations, granted a few secrets, yet I now notice that my workload is still not behaving as expected. Fortunately, the shiny charm has a show-config action which helps me get a detailed look into all the configs that the workload and charm are relying on.
I run juju run shiny-operator/0 show-config and I get an output similar to
## Workload config in /etc/shiny/shiny.yaml
<the rest of my workload config>
## Juju config options
<my Juju config options>
## Workload CLI args
<my cli args e.g. max-retention=30d`>
## Path to cert files, etc
<path to cert files>
## Relation data
<relation 1 data>
<relation 2 data>
## Possibly some e.g. the 50 most recent error logs?
I could then very easily provide this to the team that maintains shiny operator and ask them to have a look into this. Of course, when the charm author is implementing this action, based on their knowledge of the charm and its workload, they should ensure all sensitive data, including tokens, secret keys, cert contents, etc are redacted unless there is a secure and logical reason to keep them.
This is possible because a charm author has defined an action which gives up one detailed look into all that is important to the health of both the charm and the workload.
This method:
Fixes problem 1 as with one command, we grab everything we need in order to understand whether the workload and charm are working as expected.
Fixes problem 2 for the same reason: with one command, we get everything we need.
Is potentially time consuming to implement if your stack already has a relatively large number of charms.
The question now can be: should all charms show this to some extent? Does this make sense for integrator charms for example?
To summarize:
Most charms could benefit from having
show-configaction which provides all the workload and charm related configs which may explain the current behaviour of the charm and its workload.
I’d like to gather some opinions around this. I’d appreciate any feedback ![]()