Observability Team Updates - Week #13 2022

Hi everyone, below you can find the updates from the Observability Platform team for the previous week starting on 28/03/2022.

cc @leon-mintz @jose @michele-mancioppi @bthomas @rbarry @dylanstathis

  1. All the GitHub repositories maintained by the team now include both issue templates and pull request templates, making it easier for both external and internal developers to make quality contributions.

  2. Edge releases for the following charms are now automated on merge to main:

  3. Injecting Juju Topology label matchers into an alert rule expression at runtime

    • Will no longer add superfluous line breaks (#1)
    • Will now skip topology labels already present to avoid creating unmatchable expressions (#2)

Besides the extensive build and release automation introduced for most of the COS charms (see this post from @0x12b), we also did some serious work on automating the build and release of file-based resources used by the charms:

  1. promql-transform, the secret sauce that allows COS Lite to automatically apply Juju topology to PromQL alert queries, got a snazzy new build automation that, on merges to main, automatically uploads new revisions to CharmHub for amd64 and arm64 architectures.

  2. promtail, the heart of the LogProxyConsumer facility of the loki_push_api library, is now automatically built and released to the canonical/loki-k8s-operator repository with static linking whenever the upstream grafana/loki repository creates a GitHub release (see, e.g., this loki-k8s-operator GitHub release). This will enable the LogProxyConsumer, as soon as we update the loki_push_api library, to set up promtail running on all sorts of containers, irrespective of their base image (and we automatically test a bunch of them :slight_smile: )

Also, if you are a Git geek, you may find interesting how we keep in canonical/loki-k8s-operator shallow tags built from the upstream grafana/loki repository to be able, in the future, to patch and build anew promtail versions as needed.

The traefik-k8s charm has gotten a couple bugs squashed, where spurious data was left in the relation in case of scale-to-zero proxied applications (e.g., juju scale-application prometheus-k8s 0), and a race condition about the proxied application not writing some relation data quick enough, which lead to transient errors in the juju debug-log.

Also, we are making progress in re-implementing the ingress_per_unit library to remove the dependency from the serialized-data-interface package for better understandability, testability, debuggability and ease of use (no need to add stuff to your requirements.txt, charmcraft fetch-lib will suffice).

In Loki Charmed Operator:

  1. Work in progress: Integration between Loki and Traefik ingress. See PR #120

  2. The LokiPushApiProvider no longer expects planned units to follow a sequential pattern, resolving a bug that surfaced whenever there either was a gap in the sequence or the sequence did not start from 0 (#117)

  3. Alert rules integration tests were added. See Issue #103 and PR #122

More in the Loki charmed operator:

  1. The loki_push_api library now correctly handles empty rule files (#98)
  2. Fix in handling exceptions raised during relation events (#102).

Also in the Loki Operator:

  • Integration with AlertManager is ready to merge in (PR #129), and you should be able to send Loki alerts to receivers

In the Grafana Operator:

  1. A small bug was squashed in the GrafanaSourceProvider library, which previously made the assumption that all datasources would be available at the HTTP root. With ingress support rapidly making its way into Observability charms, this needed an update in (PR #85), and the Grafana Operator now happily talks to ingress-enabled Prometheus and Loki.
    • Note that this change adds an additional argument to the GrafanaSourceProvider constructor. If your charm uses this library and provides a datasource endpoint somewhere other than /, you should use the source_url parameter when instantiating it.
  2. GrafanaDashboardConsumer|Provider try to be smart about tweaking values in incoming dashboards so the datasources with automatically work with Juju Topology templated variables. In doing so, any datasource which already had the name we expected to change it to (such as prometheusds or lokids) was accidentally squashed. For the most part, dashboards from the Grafana Marketplace and hand-built dashboards were unlikely to encounter this, but our internal integration testing did.
    • This was resolved in (PR #73), which also ensures that Dashboards with “spacer” panels which have a null value for the Datasource work as expected.
  3. Finally, (PR #81) loosens the restrictions on acceptable dashboard templates and updates the docs to match. Previously, we expected all dashboards sent from charms to end in .tmpl, as early implementations used Jinja templating to substitute variables instead of adding dropdowns inthe template. Now that this is no longer used, any *.tmpl or any *.json file can be used.

In the Loki Charmed Operator:

  1. Fixed a critical bug where alerts would not fire on any unit other than the leader.

In the Prometheus Charmed Operator:

  1. Updated the prometheus_remote_write library to properly fix the same bug as above.
  2. Updated the prometheus_remote_write library to not overwrite previously added topology in alert rules. This allows rules to be forwarded through grafana-agent.