A few questions about observability and COS-light related stuff

erik-lonroth · 26 May 2023 08:45

We have been working hard to understand more and more about the COS-light stack over the last year, slowly working our way to be able to deploy it into a production-like environment with our company.

We have worked and learned from @0x12b to deploy it in a microk8-stack. Watched the Charmhub | Deploy Grafana Agent using Charmhub - The Open Operator Collection mature and now also developed a few example charms to learn how to work with it.

At this point, we have some remaining questions as we are now taking the last few steps into production which we would appreciate some help getting clarity to:

What is the technical details on how cross-controller-cross-model integrations work?

What prerequisites are there for a user to offer and consume integrations over two separate controllers for example?
What are the benefits/drawbacks of using separate controllers for k8-clouds and vm-controllers as opposed to adding “lxd-cloud -> k8-controller”, or, “k8-cloud -> lxd-controller”?

Loki: We haven’t yet been able to figure out how get the COSAgentProvider to point to specific log-files. How do we do that? Perhaps there are some docs on this?
COS light uses an old version of Grafana. Will the charms handle this always in the COS-light bundle, or how does upgrades work on grafana and other components in the COS-light stack work?
We would like to go to production on the “stable channel” - but traefik has an issue presently that causes it to lose its network-address after a system-reboot. Is there a hotfix which would allow us to go into production on a stable channel?
Backup/restore procedures. Is there anything documented on this for the COS-light stack?
TLS - how can we setup certificates that allows us to expose grafana/prometheus/traefik externally? Is there anyting written here? Guides to follow?
Are there any exemple/informat on how an alert-rule needs to be constructed for prometheus and loki. We have looked into the repo https://github.com/canonical/cos-configuration-k8s-operator/tree/main/tests/samples but its not explaining what general method/process of creating own dashboards, rules etc. Its hard to create your own.
Are there any descriptions on how we create integrations with:

How do you actually manage to UPDATE the grafana-dashboards as part of upgrading a charm using the COSAgentProvider since this is not supported by the API of grafana? We are super interested in learning the details of this since it would help us in the development process of our own dashboards.

We’ll work hard from our end to discover the above, but I figure some of you already covered this ground or some of it.

Thanx!

0x12b · 26 May 2023 09:50

Great to hear you’re making progress!

With multiple substrates, it generally makes sense to also have one Juju controller per substrate. This reduces the risk of the controller becoming a bottleneck, as well as the blast radius in case anything goes wrong in the substrate hosting the controller.

The prerequisite is that both controllers need to be able to contact each other. For Kubernetes this means that the controller needs to be bootstrapped with the following configuration options explicitly set:

controller-external-ips:
  type: list
  description: Specifies a comma separated list of external IPs for a k8s controller
    of type external
controller-external-name:
  type: string
  description: Sets the external name for a k8s controller of type external

The Grafana Agent machine charm will by default slurp up all logs either in journalctl or in /var/log automatically. As the agent itself is a strictly confined snap, it does not have access to arbitrary locations on disk.

If your workload is also running as a strictly confined snap, you can utilize the log_slots argument to the COSAgentProvider together with the snaps content interface to give the agent access to the logs.

Yes, we will take care of the upgrade paths. If there is ever a point where we can’t do this automatically, we will make sure to point this out clearly and make sure that it happens in a major release and not in a patch or minor.

We are looking to promote all current edges of our charms to stable as soon as possible. I appreciate that you restate the importance of this and apologize for it not happening earlier.

Currently, no. Backup and restore procedures are currently limited to “as you do with the rest of your charms”. This is definitely an area we need to improve going forward however.

@sed-i has done work on allowing for traefik to terminate TLS at the ingress, which should allow you to do this. Mind assisting with some details here, Leon?

Happy to arrange a session to show you this! There is a how-to here on Charmhub that covers the basics.

For zammad and node-red we do not have any descriptions or documentation, as these are not tools we work with ourselves. For Pagerduty, it’s the same as for upstream alertmanager, meaning that you should be able to follow this guide, and then attach the config file to alertmanager through the config option.

Our procedure for doing this is available in the GrafanaDashboardConsumer class of the grafana_dashboard library

Hope that helps,

Simme

erik-lonroth · 26 May 2023 11:34

This is an issue for sure since many application we might want to “observe” may have their logs pretty much anywhere. Placing a constraint on the logfile location is major problem. Some of the software we have doesn’t allow for changing location to fit with this requirement. Anyway we can get around this?

Thanx! We can wait for it and will happily test out more stuff before we put it in there.

I’m not by a first glance able to tell how you actually interact with grafana to put the dashboard in place, but apparently you are doing this on the K8 side using the relation-data containing the dashboard. We thought to be able to use the grafana API for updating dashboards, but that seems to not be possible. Thanx for the pointer tho.

This is super helpful for us going forward!