There seems to be a growing need in “self-test” / diagnostic capability for charms and bundles, that admins could run in production.
diagnose
charm action
Some checks can be localized to the charm. For example:
- Summary that is tailored for the workload, for example: disk utilization and cardinality for prometheus.
- Generate a logs archive from an air-gapped charm, to be sent to support or include
diagnose
output in issue reports. - Give insights into things that are not instrumented yet upstream. For example: “Charm is happily ‘active/idle’ even when workload cannot reach any datasources”.
Solution-level checks
Some checks are not available from within a charm. For example:
- All the dashboards of all related charms appear in grafana’s UI (
juju ssh ls
in one charm, and compare againstcurl
the workload of another). - Redundant circular relations between charms, for example both telegraf and grafana agent in the same VM are related to prometheus.
- Sometimes charms are stuck in maintenance status, and when you
k -n test-bundle-urai describe pod/grafana-0
you discover that1 node(s) had untolerated taint {node.kubernetes.io/disk-pressure: }
. Similar for resource limits. Finding this out is currently possible only by pushing k8s telemetry into an o11y tool.
General observations
- There is some overlap between the concerns of integration tests, charm status, update-status hook, charm alert rules, and this proposed “diagnose” capability.
- It could be handy if integration tests could reuse a charm’s
diagnose
action. - I also wonder how this might look with
juju diagnose
command and scriplets.
Current approaches
- Pebble health check with
on-check-failure: <check name>: ignore
. - goss - server validation tool that can also serve healthcheck results.
- sosreport plugins for kafka, zookeeper
- Kafka e2e tests (pytest operator + pylibjuju) against live deployment (guide)
- mysql
get_cluster_status
charm action (actions.yaml, code) - Prometheus diagnostics tox env (PoC PR)
Ideas
I once heard that the branches of design – architecture, interior design, graphical design, fashion and jewelry – each deal with a different scale, each overlaps another, and each is at a different distance from the human body (e.g. a person is inside a building, and an earring/tattoo is inside a person).
If we try to divide potential self-tests into categories by distance from the workload,
Jewelry | Fashion | Graphical design | Interior design | Architecture | |
---|---|---|---|---|---|
K8s/Juju analogy | Sidecar | Charm | Model, bundle | Multi-model/controller solution | |
Potential self-test | Pebble checks, goss | Charm ‘diagnose’ action, collect-status, built-in alert rules | pylibjuju live test, sosreport | pylibjuju e2e test |