The purpose of this document is to observe the behaviour of juju when a charm takes too long to exit its event handling/respond back to juju.
This document will address the following:
- What is “too long”?
- What happens to a charm’s leadership?
- What happens to the charm’s hooks lifecycle?
Context
Often times, a charm might be blocked on a synchronous operation, or have a retry logic on an API call, or, less often, just sleep. And that raises concerns, especially with juju’s leadership election in which a leader unit can be guaranteed that it will have leadership for approximately 30s from the moment the leader-elected
event is received.
The same holds with the current LEASE_RENEWAL_PERIOD
that is, when the unit obtains its current leadership status through is_leader()
, the value is cached for the duration of a lease which is 30s in Juju, that in turn, guarantees that the unit will have leadership for the next 30s (i.e calling is_leader()
will basically renew your unit leadership for an additional 30s).
After that time, juju might elect a different leader. So, it seems that that “too long” could be just a maximum of 30 seconds and after that bad things will happen. Right? The below setup tests for that exact scenario.
Test
We’ll use Loki as our test charm which we’ve modified to sleep for 2 mins at the end of every event if this charm unit is the leader unit.
class LokiOperatorCharm(CharmBase):
def __init__(self, *args):
...
self.framework.observe(self.on.collect_unit_status, self._on_collect_unit_status)
def _on_collect_unit_status(self, event: CollectStatusEvent):
if self.unit.is_leader():
logger.debug("Sleeping for 2 mins")
time.sleep(2 * 60)
Then, we deployed the charm with 2
units to see how leadership will change. By default, loki/0
unit is the leader unit that will sleep at the end of every event.
After some time:
Although loki/0
unit exceeded the LEASE_RENEWAL_PERIOD
, the charm container has not been killed and remained leader for the entire lifecycle of events. You can see the 2-mins delay between each fired event which indicates that neither the leadership has been affected nor the queue of events for the blocking unit.
Conclusions
Although, after the guaranteed 30s of leadership, leadership can change at any time even while a hook is running, juju will elect a new leader if it detects that the leader unit agent
is “down”. And in the above case, the agent is not down, the charm is just blocking a hook for some time.
The new leader election will, however, happen in cases where you’d kill the unit agent, for example (i.e juju ssh loki/0 'kill -9 $(pgrep -f "/charm/bin/containeragent unit")'
). If the agent remains down for some time (i.e >30s
), then juju will move on to another leader.