Observability Team Updates - Week 11-12

Hi everyone!

Below are the team’s updates for weeks 11 and 12. First, as always, let me introduce the fantastic team and what we’re building.

The Team

The observability team at Canonical consists of Dylan, Jose, Leon, Luca, Pietro, Ryan, and Simme. Our goal is to provide you with the best open-source observability stack possible, turning your day-2 operations into smooth sailing.

COS Lite

COS Lite is a light-weight, highly-integrated observability suite, powered by python operators and running on Juju. Find more information on charmhub or go straight to github.

The Work

Who is watching the watchmen?

As some of you may remember, I posted an article on my blog about monitoring your observability platform last September, which received quite a lot of attention, making it all the way onto the Hackernews frontpage. What I discussed in this article was the need to monitor the health of our alerting pipeline, making sure we are notified as soon as there are any hiccups in our alerting pipeline.

Let me point out that I’m aware that there are other projects which fill similar needs, like the commercial Dead Man’s Snitch and the open source Dead Man’s Switch; there really weren’t any projects available were both:

  • Fully open source
  • Actively maintained

After discussing internally and making sure we completed work-in-progress of even higher priority, we’ve finally had an opportunity to start working on bridging this gap for the COS Lite stack. The result of this effort is something we’ve decided to call the COS Alerter. The approach itself is not novel in any way and is made up of a small, portable watchdog service, which expects to be pinged continuously by an always-firing alert rule.

While this project is not really ready for use yet, we hope to have it available for initial trials in a couple of weeks. If you have thoughts or opinions about how this is best accomplished - let us know in the project repository!

Observing the Data Platform

Together with the Data Platform team, we’re making strides in how we efficiently and effortlessly monitor Data Platforms. The first step on this journey was the Grafana Agent machine charm, making it possible to observe machine charms in general. The next step has been to integrate this work with the Kafka machine charm.

Our findings so far suggest that this combination is both powerful and convenient to use. Integrating your own charms with the machine charm is fairly straightforward following this how-to guide written by José.

Dashing LMA Dashboards

Last but not least, we’ve been hard at work making sure that the existing LMA Dashboards, to the extent possible, are cross-compatible with the COS stack when related over the COS Proxy. If you encounter dashboards that do not work as intended, raise an issue in the proxy repo and we’ll either fix it or try to provide you with options.

Feedback welcome

As always, feedback is very welcome! Feel free to let us know your thoughts, questions, or suggestions either here or on the CharmHub Mattermost.

That’s all for this time! See you again in two weeks!

3 Likes

Great write-up @0x12b !

So, over a few months, https://dwellir.com has been investing in working with the COS stack, using it in exploration and labs with both physical-servers and multiple LXD clouds. This work has been driven by @awnns and @marcus and we have fought hard to get it in place.

Its been challenging and we recognize that there are still rough edges, but, we totally feel the potential and will continue work with you guys at Canonical and the Juju community to get it all polished!

Today, @marcus made a breakthrough, managing to place a k8s (microk8) within a proper lxd-controller and a cross-model-integration with a LXD based charm/model ( @stgraber ) - using the mentioned grafana-agent. Perhaps you attended the demo last week in the juju community hosted by @awnns .

This means that our production infrastructure soon can be monitored with COS-light, regardless if its k8,lxd or physical.

We walk this extra mile to reduce complexity, increase efficiency and add a level of automation we need to maintain quality. We shall now work up confidence and experience with the new tools and we’ll share our experiences in this forum and others.

On that note, next Friday we will host a workshop here, to go through what we have done to get this working and explain more in detail how we plan to use it. We are going to make ourselves available also for others we know are exploring this same track ( @jamesbeedy ).

Below, screenshots from the working COS-light stack + a test integration with the ubuntu vm-charm running in LXD.

8 Likes

Awesome!

Looking forward to it!

It is wonderful to see that after almost 3 years of hard work our COS can start to be used in production! :heart_eyes: :heart_eyes: :heart_eyes:

As you said @erik-lonroth we know that are still rough edges that needs to be polished, but we’re working on it!

Unfortunately I was not able to attend the demo, if there is a video of it, please let me know!

1 Like

@jose we have been following your progress from a distance and we are also keen on getting down to business about this.

But, I also will stress, that we need to see this running a while before we will place it in the hart of the prodstack.

We will put it to test on a test environment with some partial real load, to see that we get it right. This will involve testing the push and pull all the metrics, logs etc. but also trying to go through upgrades, backups, dashboards, alert-rules etc. to see where the edges are.

We plan to live in that situation for about a month or so at least.

2 Likes

Sorry for the late reply, but the workshop has been canceled.

Last time @awnns showed us how to deploy microk8s inside a LXD container, deploy COS-lite inside kubernetes cluster and than relate it with grafana-agent machine charm. The planned workshop session was suppose to be almost an exact replica besides this time we use microk8s snap installation instead.

Even though this was a big step for us, we felt that it was a bit unnecessary to arrange a workshop for it. It would be fun to show of some more next time… we post when this is decided. Have a great weekend!

2 Likes

Please note that the workshop in the Juju calendar is removed for today as @marcus is still in the process of getting it ready.

I just removed it, so I hope there is no misunderstanding about it. @ppasotti @tmihoc @0x12b

We are making progress =)

3 Likes

A small update here:

We have explored even more as how to place and integrate a COS into our Juju environment and are about to consolidate it with our vm charms.

This is still on our map.

1 Like