One grafana-agent charm to rule them all.

When we deploy grafana-agent charm in our models it is crucial to relate only one instance of this charm to our charmed applications.

If we allow our charms to relate to more than one instance of grafana-agent (by not setting a limit like this one), we may end up in weird situations.

Also conceptually it makes no sense to have more than one agent sending the same telemetry (metrics, logs, traces, alert rules and dashboards) to COS.

Scenario

Let’s imagine we have deployed:

and we have created a relation between ose and ga-one and ga-two, so our deployment looks like this:

Only one service and one config file for 2 charms

By ssh-ing into the machine where ose is running we can verify we have 2 grafana-agent charms running:

ubuntu@juju-292e9c-0:~$ ls -1 /var/lib/juju/agents/ | grep "unit-ga"
unit-ga-one-0
unit-ga-two-0

But, as both charm instances install grafana-agent snap and write the same config file, we have only one service (managed by both charms) and only one config file (written by both charms)

ubuntu@juju-292e9c-0:~$ ps ax | grep "/etc/grafana-agent.yaml"
  20651 ?        Ssl    0:01 /snap/grafana-agent/94/agent -config.expand-env -config.file /etc/grafana-agent.yaml

classic confinement vs strict confinement

When grafana-agent charm (ga-one) is related to any charm (in our example ose), it installs grafana-agent snap in the machine that is running:

If we relate another grafana-agent charm (ga-two) to ose the grafana-agent snap is refreshed, so only one snap is running:

ubuntu@juju-292e9c-0:~$ snap list | grep "grafana-agent"
grafana-agent               0.40.4             95     latest/stable  0x12b        classic,held

grafana-agent charm by default installs the snap in classic confinement, but we can configure the charm to use strict confinement

So let’s configure only ga-one to not use classic confinement:

$ juju config ga-one classic_snap=false 

We can verify this new config value is now false by running:

$ juju config ga-one | grep -A7 classic_snap
  classic_snap:
    default: true
    description: |
      Choose whether to use the classic snap over the strictly confined
      one. Defaults to "true".
    source: user
    type: boolean
    value: false

But what happen with the other grafana-agent charm config value?

$ juju config ga-two | grep -A7 classic_snap
  classic_snap:
    default: true
    description: |
      Choose whether to use the classic snap over the strictly confined
      one. Defaults to "true".
    source: default
    type: boolean
    value: true

It is true.

Now an interesting question may arise: What kind of confinement does the snap end up having?

ubuntu@juju-292e9c-0:~$ snap list | grep "grafana-agent"
grafana-agent               0.40.4             94     latest/stable  0x12b        held

It’s strict confinement. If it wasn’t, we would see ‘classic’.

Relation between ga-one and ose removed

Now, let’s remove one of the relations between openstack-exporter and grafana-agent:

$ juju remove-relation ga-one ose

Now our model looks like this:

And we verify that only one charm (ga-two) is running in the machine:

ubuntu@juju-292e9c-0:~$ ls -1 /var/lib/juju/agents/ | grep "unit-ga"
unit-ga-two-0

Since the relation between ga-one and ose was removed, that process uninstalls the grafana-agent snap. We see this by running:

ubuntu@juju-292e9c-0:~$ systemctl --type=service --state=running | grep grafana-agent
ubuntu@juju-292e9c-0:~$ 

So, at this point we have one grafana-agent charm deployed but no grafana-agent snap running.

Relation between ga-two and ose removed after the previous removal

Now, let’s remove the last grafana-agent charm (ga-two)

$ juju remove-relation ga-two ose

The relation is not removed, and ga-two end up in error state since it is not able to stop a service that is not running (and uninstall a snap that was already removed)

We can verify this by reading juju logs:

unit-ga-two-0: 10:38:51.521 DEBUG unit.ga-two/0.juju-log Emitting Juju event stop.
unit-ga-two-0: 10:38:51.990 ERROR unit.ga-two/0.juju-log Uncaught exception while in charm code:
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-ga-two-0/charm/lib/charms/operator_libs_linux/v2/snap.py", line 309, in _snap_daemons
    return subprocess.run(args, universal_newlines=True, check=True, capture_output=True)
  File "/usr/lib/python3.10/subprocess.py", line 526, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['snap', 'stop', '--disable', 'grafana-agent']' returned non-zero exit status 1.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-ga-two-0/charm/./src/charm.py", line 279, in _on_stop
    self.snap.stop(disable=True)
  File "/var/lib/juju/agents/unit-ga-two-0/charm/lib/charms/operator_libs_linux/v2/snap.py", line 375, in stop
    self._snap_daemons(args, services)
  File "/var/lib/juju/agents/unit-ga-two-0/charm/lib/charms/operator_libs_linux/v2/snap.py", line 311, in _snap_daemons
    raise SnapError("Could not {} for snap [{}]: {}".format(args, self._name, e.stderr))
charms.operator_libs_linux.v2.snap.SnapError: Could not ['snap', 'stop', '--disable', 'grafana-agent'] for snap [grafana-agent]: error: snap "grafana-agent" not found


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-ga-two-0/charm/./src/charm.py", line 701, in <module>
    main(GrafanaAgentMachineCharm)
  File "/var/lib/juju/agents/unit-ga-two-0/charm/venv/ops/__init__.py", line 343, in __call__
    return _main.main(charm_class=charm_class, use_juju_for_storage=use_juju_for_storage)
  File "/var/lib/juju/agents/unit-ga-two-0/charm/venv/ops/_main.py", line 543, in main
    manager.run()
  File "/var/lib/juju/agents/unit-ga-two-0/charm/venv/ops/_main.py", line 529, in run
    self._emit()
  File "/var/lib/juju/agents/unit-ga-two-0/charm/venv/ops/_main.py", line 518, in _emit
    _emit_charm_event(self.charm, self.dispatcher.event_name, self._juju_context)
  File "/var/lib/juju/agents/unit-ga-two-0/charm/venv/ops/_main.py", line 134, in _emit_charm_event
    event_to_emit.emit(*args, **kwargs)
  File "/var/lib/juju/agents/unit-ga-two-0/charm/venv/ops/framework.py", line 347, in emit
    framework._emit(event)
  File "/var/lib/juju/agents/unit-ga-two-0/charm/venv/ops/framework.py", line 857, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-ga-two-0/charm/venv/ops/framework.py", line 947, in _reemit
    custom_handler(event)
  File "/var/lib/juju/agents/unit-ga-two-0/charm/lib/charms/tempo_coordinator_k8s/v0/charm_tracing.py", line 1064, in wrapped_function
    return callable(*args, **kwargs)  # type: ignore
  File "/var/lib/juju/agents/unit-ga-two-0/charm/./src/charm.py", line 281, in _on_stop
    raise GrafanaAgentServiceError("Failed to stop grafana-agent") from e
GrafanaAgentServiceError: Failed to stop grafana-agent
unit-ga-two-0: 10:38:52.248 ERROR juju.worker.uniter.operation hook "stop" (via hook dispatching script: dispatch) failed: exit status 1

How to solve this situation?

One way to solve this is:

  • Install grafana-agent in the machine where ga-two/0 is running:

    $ juju ssh ga-two/0 sudo snap install grafana-agent
    grafana-agent 0.40.4 from Simon Aronsson (0x12b) installed
    Connection to 10.51.132.165 closed.
    
  • And resolve the unit:

    $ juju resolve ga-two/0
    
2 Likes

As for One grafana-agent charm to rule them all., it would be excellent to also provide a way to resolve the situation you’ve gotten yourself into. One way of doing this would be to ssh into the machine, reinstall the snap, and then run juju resolve.

1 Like

Good catch! I’ll add it to the post! Thanks Simme!