Self-tests for charms?

There seems to be a growing need in “self-test” / diagnostic capability for charms and bundles, that admins could run in production.

diagnose charm action

Some checks can be localized to the charm. For example:

  • Summary that is tailored for the workload, for example: disk utilization and cardinality for prometheus.
  • Generate a logs archive from an air-gapped charm, to be sent to support or include diagnose output in issue reports.
  • Give insights into things that are not instrumented yet upstream. For example: “Charm is happily ‘active/idle’ even when workload cannot reach any datasources”.

Solution-level checks

Some checks are not available from within a charm. For example:

  • All the dashboards of all related charms appear in grafana’s UI (juju ssh ls in one charm, and compare against curl the workload of another).
  • Redundant circular relations between charms, for example both telegraf and grafana agent in the same VM are related to prometheus.
  • Sometimes charms are stuck in maintenance status, and when you k -n test-bundle-urai describe pod/grafana-0 you discover that 1 node(s) had untolerated taint {node.kubernetes.io/disk-pressure: }. Similar for resource limits. Finding this out is currently possible only by pushing k8s telemetry into an o11y tool.

General observations

  • There is some overlap between the concerns of integration tests, charm status, update-status hook, charm alert rules, and this proposed “diagnose” capability.
  • It could be handy if integration tests could reuse a charm’s diagnose action.
  • I also wonder how this might look with juju diagnose command and scriplets.

COS doctor Venn

Current approaches

Ideas

I once heard that the branches of design – architecture, interior design, graphical design, fashion and jewelry – each deal with a different scale, each overlaps another, and each is at a different distance from the human body (e.g. a person is inside a building, and an earring/tattoo is inside a person).

If we try to divide potential self-tests into categories by distance from the workload,

Jewelry Fashion Graphical design Interior design Architecture
K8s/Juju analogy Sidecar Charm Model, bundle Multi-model/controller solution
Potential self-test Pebble checks, goss Charm ‘diagnose’ action, collect-status, built-in alert rules pylibjuju live test, sosreport pylibjuju e2e test
3 Likes

+1 WIP solution on Data Team side “goss validate” (pending packaging). The longread is in ticket, TL;DR: it is extremely fast and flexible self-check tool. Check it on GitHub - goss-org/goss: Quick and Easy server testing/validation

asciicast

Look great @taurus, thanks! I updated the list.

1 Like

something like this could be used to use the charm’s own collect-status facilities to gather charm-side info Compound status tree representation: a deep dive into a little `jhack` utility