Deltas vs holistic charming

sed-i · 6 July 2023 18:02

It seems that a frequent point of confusion in charming is reconciling between holistic vs deltas approaches.

Relation events give us deltas: unit/x joined or unit/x departed, and now we need to append/pop a section to/from an existing config file.
Idempotentency and robustness call for holistic: for example, on config-changed after upgrade we need to iterate over all relations to construct a full config from scratch.

Some juju events are broadly scoped:

upgrade-charm (no relation events fire so we need to rebuild the files manifest from scratch)
config-changed (we don’t know which one changed or what the previous config has been)

Some juju events are narrowly scoped:

pebble-ready for container x (now we can push the manifest and add the layer to container x)
unit x departed from relation y (now we need to pop unit x from a config file)

But even the seemingly narrowly scoped events may have broad implications:

Pebble-ready means we are ready to add a pebble layer for the first time, but we need to construct it from several relations (if they exist). Since this is the first time we’re doing this, we need to iterate over all the relevant relations, etc. (I guess pebble-ready could have been considered a narrowly-scoped event if relation events were held off until after pebble-ready.)
A departing tls-certificates relation may result in modifying multiple sections of a config file and replanning the layer to revert the workload back to HTTP instead of HTTPS, as well as update dependent libs with the change of endpoint scheme.

Charming with deltas

Perhaps we would be able to better stick to the deltas approach if we had some new juju events at our disposal.

Still, it seems to fall short when faced with inter-lib dependencies: how useful is a tls-certificates-departed event, if a dependent library was already instantiated in MyCharm’s constructor without taking into account the departing unit?

Charm upgrade

On upgrade, no relation events are emitted (unless you modify relation data as part of the upgrade sequence). So if, for example, you get certificates over relation data, they would all be lost after an upgrade. And you do not want persistent storage because then you would need to figure out which ones you need to delete after an upgrade.

So on upgrade you’d need to iterate over all relations and write those certs to disk. But not quite on upgrade-charm. We need the container to be there, so this must happen of pebble-ready (the path is short from here to common exit hook and manifest driven charming).

Charming holistically

“Holistically” may mean “take everything into consideration, all the time”.

Inevitably this introduces a common-hook/reconciler/funnel pattern. This already seems to be practiced widely.

To further improve the holistic charming experience, it seems that “relation-departed” needs to be followed by “relation-changed”, where no stale data from the departing unit is present (juju/2026302).

At ~observability we often opt for the holistic approach. How about you?

ghibourg · 6 July 2023 20:20

In the ~telco team, most charm naturally end up with an holistic approach. In most cases, a changed configuration, or a new relation means updating the same configuration file and restarting the same service. We have a lot of cases in our code bases where we have a single event handler, looking like:

def _configure_workload(self, event: EventBase) -> None:
    ...

I think the status quo of initializing libraries in __init__ is a bit problematic, because a lot libraries execute work on instantiation. Some recent examples I have used are KubernetesServicePatch and KubernetesMultusCharmLib. Those library will create and patch K8s resources before any hook is run. They usually do critical work for the charm, and are usually not tested at all (the code is covered by the unit tests, but no assertions are usually made on the side-effects they have).

I am thinking that maybe we should extract all the logic from the charm class, and have it only handle incoming events, then dispatching to one or more other classes handling the logic. Are any teams doing something in those lines?

marcoppenheimer · 6 July 2023 21:34

For Data Platform too it’s mostly holistic.

For example, the Kafka charm sets propertys for various objects instantiated by the main charm, which get applied every relation event to some generic config-changed handler. There, we build a diff of what config the charm has currently set (written to a file for the workload service to read), and what ‘config’ the charm expects to have given all the various relations (built in memory at runtime). If there’s a diff, write the diff and restart service.

Every time we get a relation event, we call the generic config-changed handler to update ALL config.

For example, getting the list of applications marked as super.users:

Property -https://github.com/canonical/kafka-operator/blob/e48c0b8a92d099474d7f5637fe2c49fef882c96e/src/config.py#L419-L443

Reference - https://github.com/canonical/kafka-operator/blob/e48c0b8a92d099474d7f5637fe2c49fef882c96e/src/config.py#L491-L527

Call - https://github.com/canonical/kafka-operator/blob/e48c0b8a92d099474d7f5637fe2c49fef882c96e/src/charm.py#L319-L355

Origin event - https://github.com/canonical/kafka-operator/blob/e48c0b8a92d099474d7f5637fe2c49fef882c96e/src/provider.py#L44-L51

marcoppenheimer · 6 July 2023 21:40

@ghibourg - It’s different for different products, but with our products, generally the pattern we’ve settled on is handling the ‘internal’ events in charm.py (install, start, config-changed etc), and handling any ‘external’ events in some other class file (tls.py, provider.py etc).

Originally that was to save LoC on charm.py as it was getting unwieldy.

Some of these other classes are self-contained, but they also (mostly) end up funneling logic through config-changed anyway. Main reason for that, is if you want coordinate events between multiple different relations/applications in particular orders, you need to be able to block/defer them if they come in too soon. e.g ‘set up TLS before giving requirers credentials. If there is a TLS relation but you haven’t got certs yet, defer requirers’. You need some central place to coordinate that, which ends up being charm.py.

erik-lonroth · 6 July 2023 21:52

At Dwellir (https://dwellir.com) we are leaning towards holistic.

Its not intentional, but rather a consequence of lacking of capability to know what config elements actually are affected (from both relations and config-change) directly from the event. Nor are there any sanity checks available on the input level of config options, which makes life hard.

Eg, its perfectly OK to enter any value for any config option as juju will happily accept any values for all configs.

This all adds up to its very difficult to implement deltas.

ppasotti · 7 July 2023 06:36

and that’s 4-0 so far

gschiano · 7 July 2023 13:00

Same at IS DevOps, we tend to use the holistic approach unless for actions which are very scoped Basically we have a concept of WorkloadState which gathers all the information (config, relation, …) the workload needs, or shares with the charm.

Then this “State” is translated into either file configuration, or environment variable (depending on what the workload expects) and compared with the current Workload configuration. In case of changes we then reload or restart.

It’s a bit more complex than that, we mostly treat Web applications that are always composed of the Webserver and the application, sometimes a change on the application doesn’t require a restart or a change on the webserver can be reloaded with a restart (or any combination of these). Therefore we tend to have a dedicated Webserver service that knows how to handle its change and a Workload service that manages its changes too. And we reconcile at the end of the processing to know if a restart or reload is needed.

But despite taking the holistic approach, we use as a convention, 1 observer = 1 hook method, even if multiple hook method call the same reconciliation method in the end.

Example: On the Flask K8S charm we have:

The Workload state: https://github.com/canonical/flask-k8s-operator/blob/main/src/charm_state.py
The webserver service: https://github.com/canonical/flask-k8s-operator/blob/main/src/webserver.py
The app service: https://github.com/canonical/flask-k8s-operator/blob/main/src/flask_app.py
Relations services: https://github.com/canonical/flask-k8s-operator/blob/main/src/databases.py and https://github.com/canonical/flask-k8s-operator/blob/main/src/observability.py

ca-scribner · 7 July 2023 15:30

…and in Kubeflow, too

We have done deltas and it can work well when it works. Maybe a nicely isolated relation or pebble container. But the cognitive load of how a delta affects the rest of the charm gets tough as complexity increases. We tried staying with deltas, but were burned by event-sequencing edge cases we hadn’t thought of, etc., that led to charms getting stuck. It was really hard to know RelationA and RelationB together affect containerC which might now affect … The dependencies were in the code, but stretch across a lot of code so they were hard to spot.

Where I’d love to see improvements is around tooling that lets someone code like deltas, but get the holistic “recompute everything” for free. It would be good to have nicely encapsulated pieces of charm logic and then define dependencies between them