Observability Team Updates - Week 21-24 (2023)

Hi everyone!

Below are the team’s updates for weeks 21 to 24. First, as always, let me introduce the fantastic team and what we’re building.

The Team

The observability team at Canonical consists of Dylan, Jose, Leon, Luca, Pietro, and Simme. Our goal is to provide you with the best open-source observability stack possible, turning your day-2 operations into smooth sailing.

TLS everywhere

We’ve started working on bringing TLS into all of COS Lite charms. As a first proof-of-concept we created a cloud-init script that bootstraps a “local root ca” with openssl and deploys alertmanager and prometheus with TLS (charm-dev-utils/7). The immediate outcome was adding TLS to the alertmanager charm (alertmanager/164), relying on the tls-certificates library and the self-signed-certificates charm (thank you Telco team!). To be able to reuse much of the code in other charms, we’re working on a wrapper around tls-certificates (o11y-libs/49). One of the ideas there is to decouple workload concerns from charm code (related: mimir-coordinator/9) – have a library that places the key and the cert in the workload’s filesystem.

See also

Self-monitoring

Tangent to the TLS effort, we’re revisiting our approach around network reachability between COS components. One outcome is using FQDN for self-monitoring (traefik/186, grafana/222).

New kind of ingress: ingress per leader (IPL)

Work is underway to fix ingress-per-app and to introduce a new kind of ingress, “ingress per leader”. This is an edge case which, as far as we know, is currently needed only for grafana (traefik/180). If you have a use case for it, we’d be happy to hear about it!

Grafana agent machine charm

The grafana-agent machine charm is gaining traction and we’ve been working on improving it (grafana-agent/198, 201, 205). Some of you may be excited about the collectors that are now enabled by default (grafana-agent/202).

Decouple charm code

Inspired by sunbeam and compound-status, we have an open PR with a proposal for decoupling “config building” and “workload management” from charm code (mimir-coordinator/9).

Testing

  • We’re steadily introducting more and more scenario tests, and we love it! (traefik/172, grafana-agent/115)

  • In integration tests, have an auto-use fixture that sets the update status interval to a “infinity” (e.g. 60m). The default is 5m and it may interfere with wait_for_idle, resulting in flaky tests (traefik/184).

  • Various scenario testing fixes (ops-scenario/35, 37)

CI

  • We continue to migrate our charms’ CI from mypy to pyright (e.g. grafana/218).

  • We’ve started integrating our image building process with oci-factory (o11y/72).

Feedback welcome

As always, feedback is very welcome! Feel free to let us know your thoughts, questions, or suggestions either here or on the CharmHub Mattermost.

That’s all for this time! See you again in two weeks! :sunny: :sunglasses:

See also

1 Like