Observability Team Updates - Week 9-10 (2023)

0x12b · 14 March 2023 13:10

Hi everyone!

Below you’ll find the updates from the team for week 7 and 8. This sprint has been a little different for us, which I’ll go into more detail on later in this post. But first, as always, let me introduce the fantastic team and what we’re building.

This week, I’ll try a somewhat different format where we’ll go a little bit more in-depth into the things we’ve been working on. Let me know what you think below!

The Team

The observability team at Canonical consists of Dylan, Jose, Leon, Luca, Pietro, Ryan, and Simme. Our goal is to provide you with the best open-source observability stack possible, turning your day-2 operations into smooth sailing.

COS Lite

COS Lite is a light-weight, highly-integrated observability suite, powered by python operators and running on Juju. Find more information on charmhub or go straight to github.

Some love for machine charms

During the last two weeks, the entire team has worked exclusively on improving our story for monitoring machine charms. The outcome? The Grafana Agent subordinate charm!

One of the great things about the prior charms used for fetching logs and metrics from machine charms is the fact that they don’t require much (if any) changes to the principal charm they’re attached to. We wanted to preserve that behavior as much as possible, while correcting some of the issues we’ve identified.

Limitations of `juju-info`

Leveraging the juju-info interface to make subordinate charms “universally compatible” is neat. However, since you’re not able to utilize this relation to provide relation data from the principal, it also becomes fairly limited in what it can provide.

In the Grafana Agent subordinate we decided to use juju-info as a way of providing some base line observability, namely:

all log files in /var/log
everything in the syslog journal
node_exporter metrics, meaning all metrics about the machine itself.

Moving the ownership to the charm authors

Due to the limitations of juju-info, the older charms relied on subordinate charms to also do the instrumentation, for instance through exporter charms and the filebeat charm.

We firmly believe that the team most suited to own the observability of a charm is not the observability team, or your ops teams, but the team that is developing and maintaining said charm.

Hence, we also introduce another relation for actively integrating with Grafana Agent: cos_agent. This relation comes with a library meant to make it as low effort as possible for anyone wishing to integrate with the observability stack. Below, you’ll find a diagram that explains how this looks for an alert rule, and where the boundaries between the libraries have been placed.

Confinement

While the prior solutions were very permissive, in the sense that each charm was able to get full access across the entire machine, the Grafana Agent machine charm is operating on top of a strictly-confined snap.

This poses a set of very unique problems, as the snaps need to be able to hand over log files and such to each other. Fortunately, there are several really good interfaces to address this in snapd out of the box. For logs, for instance, we use a combination of the log-observer interface and a content interface. Below, you’ll find a basic diagram of how the content interface works to make logs available from one strictly confined snap to another.

Feedback welcome

As always, feedback is very welcome! Feel free to let us know your thoughts, questions or suggestions either here or on the CharmHub Mattermost.

That’s all for this time! See you again in two weeks!