Charm lifecycle

ppasotti · 22 March 2022 09:52

See also: Exploring event emission sequences with jhack tail

This document is about the lifecycle of a charm, specifically the Juju events that are used to keep track of it. These events are relayed to charm code by the Operator Framework in specific sequences depending on what’s going on in the Juju model.

It is common wisdom that event ordering should not be generally relied upon when coding a charm, to ensure resilience. It can be however useful to understand the logic behind the timing of events, so as to avoid common mistakes and have a better picture of what is happening in your charm. In this document we’ll learn how:

A charm’s lifecycle can be seen to consist of three phases, each one with characteristic events and sequences thereof. The fuzziest of the three being the Operation phase, where pretty much anything can happen short of setup events.
Not all events can be reliably be assumed to occur in specific temporal orders, but some can.

In this document we will not learn:

What each event means or is typically used to represent about a workload status. For that see the SDK docs.
What event cascades are triggered by a human administrator running commands through the Juju CLI. For that see this other doc.

The graphs are screenshots of mermaid sources currently available here, pending mermaid support to be available on discourse.

Contents:

The graph
- Legend
Other events
Notes on the setup phase
Notes on the operation phase
Notes on the teardown phase
Caveats
Deprecation notices
Event semantics and data
Appendices
- Appendix 1: scenario example
- Appendix 2: deferring an event

The graph

Legend

(start) and (end) are ‘meta’ nodes and represent the beginning and end of the lifecycle of a Charm/juju unit. All other nodes represent hooks (events) that can occur during said lifecycle.
Hard arrows represent strict temporal ordering which is enforced by the Juju state machine and respected by the Operator Framework, which mediates between the Juju controller and the Charm code.
Dotted arrows represent a 1:1 relationship between relation events, explained in more detail down in the Operation section.
The large yellow boxes represent broad phases in the lifecycle. You can read the graph as follows: when you fire up a unit, there is first a setup phase, when that is done the unit enters a operation phase, and when the unit goes there will be a sequence of teardown events. Generally speaking, this guarantees some sort of ordering of the events: events that are unique to the teardown phase can be guaranteed not to be fired during the setup phase. So a stop will never be fired before a start.
The colours of the event nodes represent a logical but practically meaningless grouping of the events.
- green for leadership events
- red for storage events
- purple for relation events
- blue for generic lifecycle events

Workload and substrate-specific events

Note the [workload events] (k8s only) node in the operation phase. That represents all events meant to communicate information about the workload container on kubernetes charms. At the time of writing the only such events are:

All of these can fire at any time whatsoever during the lifecycle of a charm.

Similarly, the [pre/post]-series-upgrade (lxd only) events can only occur on machine charms at any time during the operation phase.

Notes on the setup phase

The only events that are guaranteed to always occur during Setup are start, config-changed and install. The other events only happen if the charm happens to have (peer) relations at install time (e.g. if a charm that already is related to another gets scaled up) or it has storage. Same goes for leadership events. For that reason they are styled with dashed borders.
config-changed occurs between install and start regardless of whether any leadership (or relation) event fires.
Any *-relation-created event can occur at Setup time, but if X is a peer relation, then X-relation-created can only occur at Setup, while for non-peer relations, they can occur also during Operation. The reason for this is that a peer relation cannot be created or destroyed ‘manually’ at arbitrary times, they either exist or not, and if they do exist, then we know it from the start.

Notes on the operation phase

update-status is fired automatically and periodically, at a configurable regular interval (default is 5m) which can be configured by juju model-config update-status-hook-interval.
collect-metrics is fired automatically and periodically in older juju versions, at a regular interval of 5m, AND whenever the user runs juju collect-metrics.
leader-elected and leader-settings-changed only fire on the leader unit and the non-leader unit(s) respectively, just like at startup.
There is a square of symmetries between the *-relation-[joined/departed/created/broken] events:
- Temporal ordering: a X-relation-joined cannot follow a X-relation-departed for the same relation ID. Same goes for *-relation-created and *-relation-broken, as well as *-relation-created and *-relation-changed.
- Ownership: joined/departed are unit-level events: they fire when an application has a (peer) relation and a new unit joins or leaves. All units (including the newly created or leaving unit), will receive the event. created/broken are relation-level events, in that they fire when two applications become related or a relation is removed (e.g. via juju remove-relation or because an application is destroyed).
- Number: there is a 1:1 relationship between joined/departed and created/broken: when a unit joins a relation with X other units, X *-relation-joined events will be fired. When a unit leaves, all units will receive a *-relation-departed event (so X of them are fired). Same goes for created/broken when two applications are related or a relationship is broken. Find in appendix 1 a somewhat more elaborate example.
Technically speaking all events in this box are optional, but I did not style them with dashed borders to avoid clutter. If the charm shuts down immediately after start, it could happen that no operation event is fired.
A X-relation-joined event is always followed up (immediately after) by a X-relation-changed event. But any number of *-relation-changed events can be fired at any time during operation, and they need not be preceded by a *-relation-joined event.
There are more temporal orderings than the one displayed here; event chains can be initiated by human operation as detailed in the SDK docs and the leadership docs. For example, it is guaranteed that a leader-elected is always followed by a [settings-changed], and that if you remove the leader unit, you should get *-relation-departed and a leader-settings-changed on the remaining units (although no specific ordering can be guaranteed cfr this bug…).
Secret events (in purple) can technically occur at any time, provided your charm either has created a secret, or observes a secret that some other charm has created. Only the owner of a secret can receive secret-rotate and secret-expire for that secret, and only an observer of a secret can receive secret-changed and secret-removed.

Notes on the teardown phase

Both relation and storage events are guaranteed to fire before stop/remove if they will fire at all. They are optional, in that a departing unit (or application) might have no storage or relations.
*-relation-broken events in the Teardown phase are fired in case an application is being torn down. These events can also occur at Operation time, if the relation is removed by e.g. a charm or a controller.
The entire teardown phase is skipped if the cloud is killed. The next event the charm will see in this case would be a start event. This would happen, for example, on microk8s stop; microk8s start.

Caveats

Events can be deferred by charm code by calling Event.defer(). That means that the event is put in a queue of deferred events which will get flushed by the operator framework as soon as the next event comes in, and before firing that new event in turn. See Appendix 2 for a visual representation. What this means in practice is that deferring an event can break the temporal ordering of the events as outlined in this graph; defer()ring an event twice will break the ordering guarantees we outlined here. Cf. the appendix for an UML-y representation. Cfr this document on defer for more.
The events in the Operation phase can interleave in arbitrary ways. For this reason it’s essential that hook handlers make no assumptions about each other – each handler should check its preconditions independently and operate under the assumption that the relative ordering is totally arbitrary – except relation events, which have some partial ordering as explained above.

Deprecation notices

leader-deposed is a juju hook that was planned but never actually implemented. You may see a WARNING mentioning it in the juju debug-log but you can ignore it.
collect-metrics is no longer being fired in recent juju versions.

Event semantics and data

This document is only about the timing of the events; for the ‘meaning’ of the events, other sources are more appropriate; e.g. juju-events. For the data attached to an event, one should refer to the docstrings in the ops.charm.HookEvent subclass that the event you’re expecting in your handler inherits from.

Appendices

Appendix 1: scenario example

This is a representation of the relation events a deployment will receive in a simple scenario that goes as follows:

We start with two unrelated applications, applicationA and applicationB, with one unit each.
applicationA and applicationB become related via a relation called R.
applicationA is scaled up to 2 units.
applicationA is scaled down to 1 unit.
applicationA touches the R databag (e.g. during an update-status hook, or as a result of a config-changed, an action, a custom event…).
The relation R is removed.

Note that many event sequences are marked as ‘par’ for parallel, which means that the events can be dispatched to the units arbitrarily interleaved.

Appendix 2: deferring an event

jhack tail offers functionality to visualize the deferral status of events in real time.

This is the ‘normal’ way of using defer(): an event event1 comes in but we are not ready to process it; we defer() it; when event2 comes in, the operator framework will first flush the queue and fire event1, then fire event2. The ordering is preserved: event1 is consumed before event2 by the charm.

Suppose now that the charm defers event1 again; then event2 will be processed by the charm before event1 is. event1 will only be fired again once another event, event3, comes in in turn. The result is that the events are consumed in the order: 2-1-3. Beware.

Contributors: @ppasotti

wallyworld · 23 March 2022 07:07

This is a nice write up, thank you.

Some comments:

the relation-broken hook is shown in the tear down phase of the unit but any time a relation is removed, the departed and broken hooks run, so IMO this belongs in the operation phase… [edit] I see this is mentioned later in the text, but it seems confusing to have it in the teardown phase in the diagram?

There’s mention the update-status interval is configurable; might be worth mentioning the config is per model and is changed by setting update-status-hook-interval?

FYI (in case interested), there’s a bug which is about to be fixed where leader-settings-changed runs during unit teardown on leader unit. This bug has been around for a long time.

ppasotti · 23 March 2022 07:51

Thanks for your comments! relation-broken is shown both in Teardown and in Operation, because of the reason you mention. It is my understanding that if the unit is killed, the first events it will receive are relation-broken hooks for all existing relations. For that reason it is in Teardown (as well as in Operation). By the same logic, relation-created is also in Setup and not only in Operation while obviously it will be fired any time a relation is added to a model, also during operation. Should in your opinion relation-created be only in Operation as well? Or is there an asymmetry between the two that should be reflected in the graph?

About the configuration of status-update, I’m reluctant to add too much detail on all events as this document is meant to be a reference about their timing only, but as more detailed pages are added about the individual events, I’ll hyperlink them instead.

wallyworld · 24 March 2022 10:14

Looking again, I think your representation works well - setup/teardown do run relation-created/broken, but these can also occur during the operation phase as in the diagram and so the diagram seems like a good representation of that aspect. So I think the diagrams are good and can always be refined if needed if people pose questions etc.

pedroleaoc · 7 April 2022 08:31

ppasotti · 25 May 2022 08:14

I added the collect-metrics event; previously forgotten and ignored by so many.

pedroleaoc · 14 October 2022 11:30

danielarndt · 27 January 2023 14:21

The link here is to a comment, instead of the document (I think the /3 on the end of the link just needs to be removed).

ppasotti · 27 January 2023 14:24

Should be fixed! Thanks a bunch

ppasotti · 24 November 2023 14:07

@tmihoc I think it’s time to update this title, even though I’m quite attached to it, it’s hard to guess what’ in here when you see it come up in the navigation bar.

How about:

“Charm events lifecycle”
“Charm lifecycle”

?

tmihoc · 24 November 2023 14:22

I’d go with “charm lifecycle” because we’re speaking of the lifecycle of a charm, not of an event. I’ll update it.

ppasotti · 24 November 2023 14:25

yeah but it’s the lifecycle of a charm through the lens of the events it gets? It’s hard to put a finger on what it is when I started we thought of this as “the juju state machine”, but it’s not quite that.

It’s more of the points in time at which juju tells the charm of a state transition that has happened in juju, to give the charm a chance to do its own state transition.

But that’s an awful title

tmihoc · 24 November 2023 14:26

I know. I don’t know what the right solution there is. FWIW: “A charm’s life” did have a much better ring to it. I’ll do whatever you say.

mthaddon · 4 January 2024 13:33

I’m not sure it is obvious that this is an omission from the graph above to everyone reading this doc - they’re reading this to understand what events fire when

I think it’d be good to explain why this can fire at any time, otherwise it sounds chaotic. It fires each time a workload container is restarted, which can happen if a pod is rescheduled by Kubernetes, or if the liveness check defined by an individual container fails. Also, we don’t mention here that this is specific to k8s charms. I think it might also be worth mentioning that other events may also be fired if the pod is rescheduled to a different k8s worker but if only an individual workload container is restarted the pebble-ready event is the only one that will fire.

bartz · 9 April 2024 19:37

@ppasotti Could you please update the diagram to include a hard arrow between upgrade-charm and config-changed? According to Event 'upgrade-charm' (and https://github.com/juju/juju/blob/3.5/doc/charms-in-action.txt#L157-L158), an upgrade-charm event is followed by a config-changed event.

ppasotti · 10 April 2024 07:08

done! thanks