COS-Lite docs - Managing deployments of COS-Lite HA addons

:warning: some of the HA addons discussed in this article are under development and, at the time of writing, not suitable for production usage :warning:

Certain HA COS-Lite addons, such as Tempo, Mimir and Loki, can be managed in similar ways by virtue of their common “Coordinator/Workers” architecture.

In a nutshell, the coordinator charm acts as single integration point for all relations to and from cos-lite and external charms, while independently-scalable worker applications run the various components of the distributed application.

The coordinator charm also deploys an nginx server to act as load-balancer for the worker units, and can therefore also be scaled according to your needs.

Finally, the last responsibility of the coordinator charm is to generate the config that each worker node will run with and send it to them over the cluster relation.

All worker nodes run exactly the same binary with the same configuration file, but with a different command-line flag to configure what component(s) it will be running. In technical terms, we refer to this component configuration as the ‘role’ or ‘roles’ that a worker is running with.

This architecture, which is shared by several Grafana applications such as Tempo, Mimir and Loki, is what enabled us to write three charm bundles following a very similar pattern.

Deploying Coordinator/Workers charmed applications

When you deploy a Coordinator/Workers application such as Tempo, you need to deploy a single instance of the coordinator charm and at least one instance of the worker charm.

By convention, we strip the ‘coordinator’ prefix when naming the coordinator app, as it feels more natural to refer to the application without mentioning its role in the deployment since it is the single point of contact for it: you will very rarely need to manually interact with the worker charms, if not to scale them up or down or change their role (see below). Also, the only integration supported by worker charms is that with the respective coordinator. All external integrations go through the coordinator.

For example:

juju deploy tempo-coordinator-k8s tempo --trust
juju deploy tempo-worker-k8s tempo-worker --trust
juju relate tempo tempo-worker

Integrating with s3

All of Tempo, Mimir and Loki require an s3 integration for object storage. Without that integration, the charm will set blocked status and the application will not start.

As such, if you don’t already have an application providing an s3 endpoint to integrate with the coordinator, you should deploy one. For testing purposes, we recommend following this guide to deploy minio and the s3-integrator.

:warning: Note that, as of rev 41 for s3-integrator, each HA solution (e.g: Tempo HA) requires its own s3-integrator application instance, as each s3-integrator application can only have one set of unique bucket configurations, and since different HA solutions require distinct buckets, multiple s3-integrator applications must be deployed. see more

Understanding roles

Each worker charm, depending on the application, can take one or multiple roles.

For the deployment as a whole to be consistent, at least one of each of a certain set of required roles need to be assigned to a worker node. The coordinator charm collects the roles of each worker application and, if some required roles are missing, will set blocked status, stop all services if any are started, and not start them again until the deployment is brought back to a consistent state.

Additionally, some roles are “optional”. If all required roles are assigned, but not all optional ones, the coordinator charm will notify the user by setting an active status with a message signalling that the deployment is degraded.

Finally, some roles are ‘meta’, effectively providing shortcuts to common groupings of individual atomic roles (typically, separating read/write/backend application functionality).

application multiple roles per worker supported required roles optional roles meta roles
tempo no querier, query-frontend, ingester, distributor, compactor metrics-generator all
mimir yes query-scheduler, query-frontend, querier, store-gateway, ingester, distributor, compactor, ruler, alertmanager overrides-exporter, flusher read, write, backend, all
loki yes read, write, backend - all

You can control what roles a worker charm is taking by using Juju config options called role-<ROLE_NAME>. The pattern is shared between all Coordinator/Workers charms. For example, in tempo, you can:

juju config tempo-worker role-ingester=true # enable ingester role
juju config tempo-worker role-distributor=false # disable ingester role

“Monolithic” deployment

All worker charms support the “all” meta-role, which runs the application enabling all components. This is the default configuration of the worker charm.

This means that a deployment consisting of a single worker charm with the “all” role enable is consistent. This is what we call the “monolithic” deployment.

“Distributed” deployment

You can configure a worker application to enable only the role(s) you like. This allows you to flexibly control your deployment topology and fine-tune the placement of the different application paths in your infrastructure.

tempo-worker only supports running a single role per node, so if you want to configure a worker app to run, say, as a querier, you will need to disable all other roles first.

To deploy Coordinator/Worker applications in a distributed manner, you will need to deploy multiple instances of the worker charm instead of a single one, and configure them to take different (possibly overlapping) subsets of the required roles. Do mind that the union of the enabled roles needs to sum up to the consistent set of roles for the application at all times, or the coordinator charm will set blocked and the application will go down.

For example, this will deploy a mimir cluster with individual read/write/backend path workers:

juju deploy mimir-worker-k8s --config role-read=true --config role-all=false mimir-read
juju deploy mimir-worker-k8s --config role-write=true --config role-all=false mimir-write
juju deploy mimir-worker-k8s --config role-backend=true --config role-all=false mimir-backend

Then deploy a coordinator charm and integrate all the workers with it over the cluster relation and you should have a working mimir cluster!

Transitioning between deployment modes and reassigning roles

It is possible to turn a monolithic deployment into a distributed one and the other way around. So long as the storage backend (s3) remains alive, you will have no data loss. To minimize downtime, however, you will need to take precautions so that the cluster remains consistent throughout the transition.

The easiest way to ensure that while you are moving/removing/redistributing worker roles is to deploy a temporary worker node with the default ‘all’ role enabled and integrate it with the coordinator. That will ensure that the application will remain alive while you fiddle with the other workers.

3 Likes

In the understanding-roles section including the table with sample meta role combinations. It is unclear what the purpose of the multiple column is? Is this multiple applications, meta-roles, … ? I assume this is referring to multiple meta-roles but it could be more clear.

1 Like

It’s meant to specify if you can specify multiple roles at the same time; for Mimir and Loki, you can juju config mimir role-read=true role-write=true, but in Tempo multiple roles can’t be enabled at the same time. We should probably word it better!

1 Like