When we deploy grafana-agent
charm in our models it is crucial to relate only one instance of this charm to our charmed applications.
If we allow our charms to relate to more than one instance of grafana-agent
(by not setting a limit
like this one), we may end up in weird situations.
Also conceptually it makes no sense to have more than one agent sending the same telemetry (metrics, logs, traces, alert rules and dashboards) to COS.
Scenario
Let’s imagine we have deployed:
openstack-exporter
(a charm that inlatest/stable
rev. 31 does not limit the amount ofgrafana-agent
relations) namedose-one
- 2 applications of
grafana-agent
: sayga-one
andga-two
and we have created a relation between ose
and ga-one
and ga-two
, so our deployment looks like this:
Only one service and one config file for 2 charms
By ssh-ing
into the machine where ose
is running we can verify we have 2 grafana-agent
charms running:
ubuntu@juju-292e9c-0:~$ ls -1 /var/lib/juju/agents/ | grep "unit-ga"
unit-ga-one-0
unit-ga-two-0
But, as both charm instances install grafana-agent
snap and write the same config file, we have only one service (managed by both charms) and only one config file (written by both charms)
ubuntu@juju-292e9c-0:~$ ps ax | grep "/etc/grafana-agent.yaml"
20651 ? Ssl 0:01 /snap/grafana-agent/94/agent -config.expand-env -config.file /etc/grafana-agent.yaml
classic
confinement vs strict
confinement
When grafana-agent
charm (ga-one
) is related to any charm (in our example ose
), it installs grafana-agent
snap in the machine that is running:
If we relate another grafana-agent
charm (ga-two
) to ose
the grafana-agent
snap is refreshed, so only one snap is running:
ubuntu@juju-292e9c-0:~$ snap list | grep "grafana-agent"
grafana-agent 0.40.4 95 latest/stable 0x12b classic,held
grafana-agent
charm by default installs the snap in classic
confinement, but we can configure the charm to use strict
confinement
So let’s configure only ga-one
to not use classic
confinement:
$ juju config ga-one classic_snap=false
We can verify this new config value is now false
by running:
$ juju config ga-one | grep -A7 classic_snap
classic_snap:
default: true
description: |
Choose whether to use the classic snap over the strictly confined
one. Defaults to "true".
source: user
type: boolean
value: false
But what happen with the other grafana-agent
charm config value?
$ juju config ga-two | grep -A7 classic_snap
classic_snap:
default: true
description: |
Choose whether to use the classic snap over the strictly confined
one. Defaults to "true".
source: default
type: boolean
value: true
It is true
.
Now an interesting question may arise: What kind of confinement does the snap end up having?
ubuntu@juju-292e9c-0:~$ snap list | grep "grafana-agent"
grafana-agent 0.40.4 94 latest/stable 0x12b held
It’s strict
confinement. If it wasn’t, we would see ‘classic’.
Relation between ga-one
and ose
removed
Now, let’s remove one of the relations between openstack-exporter
and grafana-agent
:
$ juju remove-relation ga-one ose
Now our model looks like this:
And we verify that only one charm (ga-two
) is running in the machine:
ubuntu@juju-292e9c-0:~$ ls -1 /var/lib/juju/agents/ | grep "unit-ga"
unit-ga-two-0
Since the relation between ga-one
and ose
was removed, that process uninstalls the grafana-agent
snap. We see this by running:
ubuntu@juju-292e9c-0:~$ systemctl --type=service --state=running | grep grafana-agent
ubuntu@juju-292e9c-0:~$
So, at this point we have one grafana-agent
charm deployed but no grafana-agent
snap running.
Relation between ga-two
and ose
removed after the previous removal
Now, let’s remove the last grafana-agent
charm (ga-two
)
$ juju remove-relation ga-two ose
The relation is not removed, and ga-two
end up in error
state since it is not able to stop
a service that is not running (and uninstall a snap that was already removed)
We can verify this by reading juju logs:
unit-ga-two-0: 10:38:51.521 DEBUG unit.ga-two/0.juju-log Emitting Juju event stop.
unit-ga-two-0: 10:38:51.990 ERROR unit.ga-two/0.juju-log Uncaught exception while in charm code:
Traceback (most recent call last):
File "/var/lib/juju/agents/unit-ga-two-0/charm/lib/charms/operator_libs_linux/v2/snap.py", line 309, in _snap_daemons
return subprocess.run(args, universal_newlines=True, check=True, capture_output=True)
File "/usr/lib/python3.10/subprocess.py", line 526, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['snap', 'stop', '--disable', 'grafana-agent']' returned non-zero exit status 1.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/var/lib/juju/agents/unit-ga-two-0/charm/./src/charm.py", line 279, in _on_stop
self.snap.stop(disable=True)
File "/var/lib/juju/agents/unit-ga-two-0/charm/lib/charms/operator_libs_linux/v2/snap.py", line 375, in stop
self._snap_daemons(args, services)
File "/var/lib/juju/agents/unit-ga-two-0/charm/lib/charms/operator_libs_linux/v2/snap.py", line 311, in _snap_daemons
raise SnapError("Could not {} for snap [{}]: {}".format(args, self._name, e.stderr))
charms.operator_libs_linux.v2.snap.SnapError: Could not ['snap', 'stop', '--disable', 'grafana-agent'] for snap [grafana-agent]: error: snap "grafana-agent" not found
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/var/lib/juju/agents/unit-ga-two-0/charm/./src/charm.py", line 701, in <module>
main(GrafanaAgentMachineCharm)
File "/var/lib/juju/agents/unit-ga-two-0/charm/venv/ops/__init__.py", line 343, in __call__
return _main.main(charm_class=charm_class, use_juju_for_storage=use_juju_for_storage)
File "/var/lib/juju/agents/unit-ga-two-0/charm/venv/ops/_main.py", line 543, in main
manager.run()
File "/var/lib/juju/agents/unit-ga-two-0/charm/venv/ops/_main.py", line 529, in run
self._emit()
File "/var/lib/juju/agents/unit-ga-two-0/charm/venv/ops/_main.py", line 518, in _emit
_emit_charm_event(self.charm, self.dispatcher.event_name, self._juju_context)
File "/var/lib/juju/agents/unit-ga-two-0/charm/venv/ops/_main.py", line 134, in _emit_charm_event
event_to_emit.emit(*args, **kwargs)
File "/var/lib/juju/agents/unit-ga-two-0/charm/venv/ops/framework.py", line 347, in emit
framework._emit(event)
File "/var/lib/juju/agents/unit-ga-two-0/charm/venv/ops/framework.py", line 857, in _emit
self._reemit(event_path)
File "/var/lib/juju/agents/unit-ga-two-0/charm/venv/ops/framework.py", line 947, in _reemit
custom_handler(event)
File "/var/lib/juju/agents/unit-ga-two-0/charm/lib/charms/tempo_coordinator_k8s/v0/charm_tracing.py", line 1064, in wrapped_function
return callable(*args, **kwargs) # type: ignore
File "/var/lib/juju/agents/unit-ga-two-0/charm/./src/charm.py", line 281, in _on_stop
raise GrafanaAgentServiceError("Failed to stop grafana-agent") from e
GrafanaAgentServiceError: Failed to stop grafana-agent
unit-ga-two-0: 10:38:52.248 ERROR juju.worker.uniter.operation hook "stop" (via hook dispatching script: dispatch) failed: exit status 1
How to solve this situation?
One way to solve this is:
-
Install
grafana-agent
in the machine wherega-two/0
is running:$ juju ssh ga-two/0 sudo snap install grafana-agent grafana-agent 0.40.4 from Simon Aronsson (0x12b) installed Connection to 10.51.132.165 closed.
-
And resolve the unit:
$ juju resolve ga-two/0