Self-tests for charms?

sed-i · 28 May 2024 16:30

There seems to be a growing need in “self-test” / diagnostic capability for charms and bundles, that admins could run in production.

`diagnose` charm action

Some checks can be localized to the charm. For example:

Summary that is tailored for the workload, for example: disk utilization and cardinality for prometheus.
Generate a logs archive from an air-gapped charm, to be sent to support or include diagnose output in issue reports.
Give insights into things that are not instrumented yet upstream. For example: “Charm is happily ‘active/idle’ even when workload cannot reach any datasources”.

Solution-level checks

Some checks are not available from within a charm. For example:

All the dashboards of all related charms appear in grafana’s UI (juju ssh ls in one charm, and compare against curl the workload of another).
Redundant circular relations between charms, for example both telegraf and grafana agent in the same VM are related to prometheus.
Sometimes charms are stuck in maintenance status, and when you k -n test-bundle-urai describe pod/grafana-0 you discover that 1 node(s) had untolerated taint {node.kubernetes.io/disk-pressure: }. Similar for resource limits. Finding this out is currently possible only by pushing k8s telemetry into an o11y tool.

General observations

There is some overlap between the concerns of integration tests, charm status, update-status hook, charm alert rules, and this proposed “diagnose” capability.
It could be handy if integration tests could reuse a charm’s diagnose action.
I also wonder how this might look with juju diagnose command and scriplets.

COS doctor Venn

Current approaches

Pebble health check with on-check-failure: <check name>: ignore.
goss - server validation tool that can also serve healthcheck results.
sosreport plugins for kafka, zookeeper
Kafka e2e tests (pytest operator + pylibjuju) against live deployment (guide)
mysql get_cluster_status charm action (actions.yaml, code)
Prometheus diagnostics tox env (PoC PR)

Ideas

I once heard that the branches of design – architecture, interior design, graphical design, fashion and jewelry – each deal with a different scale, each overlaps another, and each is at a different distance from the human body (e.g. a person is inside a building, and an earring/tattoo is inside a person).

If we try to divide potential self-tests into categories by distance from the workload,

	Jewelry	Fashion	Graphical design	Interior design	Architecture
K8s/Juju analogy	Sidecar	Charm		Model, bundle	Multi-model/controller solution
Potential self-test	Pebble checks, goss	Charm ‘diagnose’ action, collect-status, built-in alert rules	pylibjuju live test, sosreport	pylibjuju e2e test

taurus · 5 June 2024 07:10

+1 WIP solution on Data Team side “goss validate” (pending packaging). The longread is in ticket, TL;DR: it is extremely fast and flexible self-check tool. Check it on GitHub - goss-org/goss: Quick and Easy server testing/validation

sed-i · 5 June 2024 16:57

Look great @taurus, thanks! I updated the list.

ppasotti · 10 June 2024 09:56

something like this could be used to use the charm’s own collect-status facilities to gather charm-side info Compound status tree representation: a deep dive into a little `jhack` utility