Charming without any observers

sed-i · 7 February 2024 06:23

Juju is all about events. Juju commands such as config and relate subscribe a given charm to some Juju events. And that has nothing to do with ops.

Charms are short-lived processes. A charm is “woken up” (actually, started) by a Juju event. Charms themselves don’t really subscribe to anything: a framework.observe call never leaves the charm; it is Juju (and perhaps to some degree pebble, when pebble “notices” become available?) that really “subscribes” and “wakes up” everything.

It is the world that has been pulled over your eyes to blind you from the truth. (The Matrix)

When ops-based charms enter their __init__, it’s already after ops has parsed some envvars and figured out the context. The familiar framework.observe(..., ...) calls in charm code are just an abstraction for switch-case. For trivial charms, it’s brilliant; for what we’re trying to accomplish, it doesn’t seem to scale:

Hooks are processed in the order of the observe statements, but it’s quite implicit and bears unexpected behavior when someone refactors and changes some previously assumed order.
Emission of custom events often introduces unexpected ordering issues.

It seems that charming could be greatly simplified if charm.py could be as simple as:

from ops.juju_context import (
    context_from_environ,
    PebbleReadyContext,
    RelationChangedContext,
)


def update_cert(context):
    def _update_sys():
        context.container("workload_name").exec(
            ["update-ca-certificates", "--fresh"]
        ).wait()

    if context.is_a(PebbleReadyContext.kind("workload_name")):
        _update_sys()

    elif context.is_a(RelationChangedContext.kind("tls-certificates")):
        context.container("workload_name").push(
            "/some/where.cert",
            context.relation.data["foo"],
        )
        _update_sys()

def main():
    context = context_from_environ()
    if context.is_in([PebbleReadyContext, RelationChangedContext]):
        update_cert(context)


if __name__ == "__main__":  
    main()

This way,

Execution order is explicit.
Hook context can be easily propagated downstream, so hook filtering (“observe”) can be done both upstream and downstream.
Writing idempotent charms is more obvious.
(Bonus) Clearer separation between Juju and ops.

If you find this intriguing, please join the conversation here or there.

sed-i · 7 February 2024 06:46

I imagine @wallyworld’s “event groups” concept is functionally equivalent.

ppasotti · 7 February 2024 08:04

I agree with you on the downsides of the observer pattern, but I think there are benefits to having a charm object. I can see that the logic you get if you write charmless charms like you propose here is closer to the ‘truth’ of what’s happening, but I still think of the charm as a useful abstraction.

It’s a guiding metaphor if you like, if we get rid of the Charm type, whose databags are we working with?

So I’d rather see something like:

# in ops.charm
class CharmBase:
    def on_install(self, _): pass
    def on_collect_unit_status(self, _): pass
    def on_start(self, _): pass
    def on_relation_broken(self, _): pass
    [...] # all possible events

# user code: charm.py
from ops import CharmBase
class MyCharm(CharmBase):
    def __init__(self, state):
        self.foo = state.config.get('foo')

    def on_relation_broken(self, e):
        if e.relation.name == "database":
            ...

I’d be in favor, as you suggest, to get rid of custom events and replace them with a callback model. I think the complexity we have because of custom events isn’t worth the API we get for it.

Also as you noticed I snuck in there a state construct which imho would be great to have as a first-class citizen in ops. IMHO it makes reasoning about charming much easier.

carlcsaposs · 7 February 2024 08:22

Here’s a related prototype proof-of-concept of what ops (or something similar) could look like without framework.observe: https://github.com/carlcsaposs-canonical/charm

And the mysql-router charm, slightly simplified, using it: https://github.com/carlcsaposs-canonical/ops-api-demo-mysql-router/blob/charm-api/src/main.py (compare to main branch for ops usage)

dylanstathis · 7 February 2024 10:26

I think this could be done without really needing any change to ops. Leaving it as just an option. For example

class MyCharm(CharmBase):
  def __init__(self, *args, **kwargs):
    ...
    event_type = somelib.get_event(os.environ)
    if event_type == config_changed:
      self.write_config()

  def write_config(self):
    enable_tls = self.config.get(enable_tls, False)
    ...

The get_event function could be added to ops as well as really the only change without getting rid of observers.

ppasotti · 7 February 2024 10:48

that would do for a quick and dirty workaround or experiment, what we’re discussing here is a chance to rewrite all of our charms with a new/different/better framework. It’s not that we can’t do it right now with the tools at our disposal, we’re wondering if we need different tools that promote this pattern instead of a different one.

dylanstathis · 7 February 2024 11:00

Do we want a new framework? In my experience, everything from ops that isn’t self.framework.observe is great. And a new framework would make all new charms incompatible with existing libraries.

ppasotti · 7 February 2024 11:03

I didn’t mean scratching ops entirely, but changing its API (which we can and should do in a backwards-compatible manner et cetera…)

jameinel · 7 February 2024 16:39

I’m very open to exploring the space, and finding patterns that scale well for people. For me I find the idea of having a bespoke set of if/else constructs to scale poorly and be hard to manage. Certainly that is the goal of things like registries, dispatching, and even the python 3.11 (?) syntax for pattern matching.

Kubernetes has the concept of the “resolver” pattern, where it just invokes your operator on an appropriate cadence, and that goes and figures out what it should do. My understanding from others is that the pattern is quite poor, and leads to large if/else blocks that are hard to manage and maintain.

There is also very much a possibility that the “charm dies between every invocation” will also change in the future. Certainly that was something that we wanted to explore when designing the library. There is no need for the charm to die, but just not a good structure for it to stay resident at the moment. We certainly would like to support at least a lookahead / get the next event model. So that you could keep the process resident for now, especially to handle the “I have 50 units that are joining the relation” without having to spawn a new process each time.

jameinel · 7 February 2024 16:48

It is entirely plausible that custom events don’t play nicely, as those had not been explored as completely. (eg, should they trigger immediately, or should they be triggered only once you get back to main, closer to a deferred event, so that you only have 1 event at any given time that is being processed.)

Fundamentally custom events today are just callbacks, they are just named and registered multi-way callbacks rather than just a single func (and explicitly no return values from the callback).

ca-scribner · 7 February 2024 18:03

I don’t know if a simple inheritance model quite gets us there. If we are subclassing CharmBase and overriding methods, then there’s always just one of everything. That would break patterns like how the KubernetesServicePatch library adds additional subscriptions to the events it needs (although maybe this pattern could be replaced with something else).

ca-scribner · 7 February 2024 18:07

I’m also not sure if its the system of how events enter a charm that is the issue, but rather the breadth of events and how each invocation is independent. Because we have pretty fine-grained events, there’s a temptation to do fine-grained work (when on pebble-ready-A do things for containerA, on pebble-ready-B do things for containerB, …). This can work, but it is hard and error prone because there’s so many paths through the program (and, sometimes, the ops abstractions mean there’s paths you don’t expect). Often what I really want is something more like if pebble-ready-A AND pebble-ready-B: do_everything, but the framework doesn’t help me with that. My guess is, in the k8s world at least, we could merge most of those core lifecycle events together and things feel easier.

Regarding event independence, to me it is a problem that the framework doesn’t protect me against this trap of doing atomic work during events:

handle pebble-ready-A: we see there is a config set incorrectly so we can’t do what we want with containerA, so set Blocked(config1 is invalid, please change it) to alert the user and exit`
handle relation-changed: everything is good with this relation’s data, so we set Active

That status is monatomic and event 2 can clobber event 1’s status is a huge trap. There are ways to code around this (especially with recently improved statuses), but it is too easy to make this mistake. imo, the framework should either stop me or really strongly discourage me from mistakes like the one above. And this is made worse by our fine-grained events, because we have that many more chances for this mistake to bite us.

I’m not sure which examples @jameinel is thinking of wrt kubernetes operator patterns, but the ones I know tend to reconcile everything on each wakeup rather than do small bits of work and iiuc it is to address these issues. Reconciling everything is often computationally inefficient, but it is simpler to implement, nicely addresses the current issue of how an unrelated part of a charm might be blocked, and is often good enough.

sed-i · 7 February 2024 18:13

Agreed, @ca-scribner. But I imagine in the near future we will have:

Juju-level means to reject a config option (lp/1969521).
“Advanced” pebble notices (wake up a charm via whatever).

Together with a standalone Context object it would be a new world to discover.

As for statuses, have you seen the updated summary at the bottom of this post?

ghibourg · 7 February 2024 18:35

I like seeing the possibility of charms not dying between each event, as in a lot of cases, we first need to read the state of the world. This can be relatively expensive in some complex charms, and encourages the common handler pattern. Keeping the state in memory could have some issues however, and developers would need to ensure they listen to all the right events to be kept up to date. We would also need a way to rebuild that state from scratch anyway, particularly on K8s where the pod could be rescheduled on another node.

I do not dislike the idea of the context, and would probably want to align with State from scenario. For most charms (with some small exceptions for machine charms that need to target <22.04), we would be able to use structural pattern matching on that to check a lot of preconditions at once.

I also think that framework.observe is problematic, particularly because it is a leaky abstraction that makes it harder for beginners to grok the flow.

sed-i · 7 February 2024 21:38

The following seems to me very similar in nature and volume to the sequence of observe calls we currently have in __init__:

    context = context_from_environ()
    if context.is_in([PebbleReadyContext, RelationChangedContext]):
        update_cert(context)
    elif context.is_in([Another, YetAnother]):
        something_else(context)
    elif ...

To make it even more similar to what we know, the hypothetical API could be:

def main():
    context = context_from_environ()
    context.if_in_then([PebbleReadyContext, RelationChangedContext], update_cert)
    context.if_in_then([Another, YetAnother], something_else)

benhoyt · 7 February 2024 23:46

I like the enthusiasm behind this kind of experimentation, and in fact, there’s nothing stopping charmers from doing a proof-of-concept in exactly this style today. It’s possible to write charms in bash, Rust, Go, Python with Ops, or Python without Ops.

But – to quote from another movie – “your charmers were so preoccupied with whether or not they could, they didn’t stop to think if they should” (Jurassic Park).

As a team and company it’s beneficial to use common structures and patterns, and that’s where Ops comes in (and Reactive before that). There’s still a lot of flexibility in how you structure Ops-based charms, of course, and that’s where additional structure such as IS DevOps Managing Charm Complexity can be helpful (there are other reasonable approaches too). We don’t want to be asking people to rewrite their charms in New Framework X when some teams are just catching up to rewriting their charms in Ops. As a team we want some consistency between charms, so we need to settle somewhere.

Again, I’m happy to see proofs-of-concept and experimentation, proving these ideas out with semi-realistic charms – but there’s going to be a very high bar for overhauling the Ops API and rewriting all of our charms with a new framework (even if in the abstract it’s better).

I think it’s probably more useful to spend time on incremental improvements to Ops, trying to address the pain points in backwards-compatible ways. That’s far less fun! But also a lot more productive for our existing charm teams. As one simple/silly example, we might consider extending Framework.observe to allow observing a tuple of events (similar to how isinstance allows a tuple of classes):

framework.observe((self.on.foo_relation_changed, self.on.bar_pebble_ready),
                  self._update_cert)

That said, I’m not sure that suggestion is a major improvement over just two calls to Framework.observe. Just throwing one idea out there. We’re already working on other ideas, like various improvements to defer, and addressing charm initialisation issues. We’re happy to work with people on other incremental improvements, such as the proposed centralisation of loading Juju environment variables.

A side note about custom Pebble Notices, mentioned in this thread: they have been implemented, and are available now in Juju 3.4-rc2 (the stable version should be coming out soon).

benhoyt · 8 February 2024 22:16

Just a brief follow-up after Leon clarified something to me: his intention is not about “overthrowing Ops” but to make it easier to experiment with new patterns. That seems very reasonable to me, and once again, we’re open to working with folks on incremental refactoring and improvements to Ops that make such experimentation easier.

ppasotti · 9 February 2024 07:35