Simulating events in live units

[excerpt of a community workshop @ 16 sept 2022]

When the juju agent decides it’s time to execute the charm, the dispatch script is ran with a certain environment. ops translates bits and pieces of that environment into an Event object, which the charm can use to reason about why the charm code itself is being executed; e.g.

  • this remote unit touched its relation data
  • my config changed
  • the user ran an action

When developing or debugging a charm, it is often useful to make subtle changes to the source code and see how that affects the runtime behaviour. However, it takes time to re-pack, re-deploy, re-run the (potentially long) sequence of actions and events that led to that specific, possibly broken state of affairs.

Ideally, we would like to spin up our development environment, and either:

  • force a specific event to occur, so that we can observe the runtime code paths
  • wait for an event to occur (or for an event to break something!)

In both those cases, we’d like to be able to quickly iterate between making changes to the source and re-firing the same event over and over until the behaviour converges.

In this post I’ll explain some tooling that makes this process possible, with some limitations. We’ll dive straight into it, hands-on.

What you need: jhack revision > 82 (available on the edge channel at the time of writing).

Get it from the Snap Store

Access to a (micro)k8s cloud, a model, and a good cuppa coffee.

Setting the stage

We’re going to work with two charms, traefik and prometheus. To set the model up:

j model-config logging-config="<root>=WARNING;unit=DEBUG"
j deploy traefik-k8s --channel edge trfk --config external_hostname=foo.bar
j deploy traefik-k8s --channel edge trfk
j relate trfk prom:ingress

Wait for the model to settle. Meanwhile we can fire up jhack tail -rl 10 to see the events come up.

Injecting a recorder script

We can visualize the standard charm runtime for a k8s charm as:

This graph is about k8s charms, the picture for machine charms is a bit different but the same runtime flow applies.

What we want to do is insert a listener between dispatch and the charm that, on each incoming event, serializes the environment and drops it to a database, where we can access it and use it to repeat the charm execution “with the same environment”. There are plenty of caveats to this, but we’ll get there later.

If you execute jhack replay install trfk/0, that is exactly what will happen. The picture now looks like:

At this point jhack replay gives you access to three more commands:

  • list: to enumerate the events that the database has recorded so far.
  • dump: to get the raw database contents.
  • emit: to re-emit a previously recorded event, by its enumeration index.

Populating the database

If you try to jhack replay list trfk/0 at this point, you’ll likely get a message telling you that the database is empty. The recorder has just been installed, so we need to wait for something to happen.

Of course, we can speed up time (jhack ffwd) and get an update-status, but wouldn’t it be nice to just get an event right now?

Type jhack fire update-status trfk/0 and the charm is going to execute an update-status hook. How is this possible? Jhack fire takes a different approach than jhack replay: it synthesizes an environment from scratch, instead of copying it from some “real” recorded event.

To make it more interesting, try jhack fire ingress-per-unit-relation-changed. Behind the scenes, jhack fire is using juju exec, a command which runs a command in a unit “as if” the juju agent were running it in a live event context; and in this case, jhack is using it to call dispatch.
Question: In this case, the event context requires several context variables to be set (JUJU_RELATION_ID for example), for ops to be able to determine which ingress relation has changed. How does jhack get the relation id?

At this point, running jhack replay list trfk/0 will show the event you just fired (and maybe others that fired as you were reading this). image

The context generated by jhack fire has been enriched by a number of other context vars injected by the juju agent, including for example the model name and UUID, without which charm code would misbehave.

If you run jhack replay dump trfk/0 0 you should be able to inspect that environment.

image

Simulating events

If you now run jhack replay emit 0, the charm will re-run that event: :tada: .

The warnings you see are due to an unsolved issue involving how to escape the whitespace in the wrapped juju exec command. If you know how to fix it, by all means: https://github.com/PietroPasotti/jhack/issues/18

You can mix and match fire and replay to get where you want, but there are a couple of serious caveats to keep in mind.

State we wrap, state we don’t wrap: false positives

The execution context of a charm is a dynamic thing. Only a part of it, the env, the metadata, the config, is static (within the context of a charm execution, that is). But if your charm, say,

  • makes HTTP calls to a remote server to check if it’s up
  • checks container.can_connect()
  • reads/writes relation data
  • manipulates stored state
  • checks leadership throughout a hook which takes longer than 30 seconds to return
  • makes pebble calls to check the live status of workload resources
  • makes substrate api calls to retrieve the live status of substrate resources

… or basically anything else which is not part of the environment variables the charm is ran with, in all of these cases, the runtime behaviour can’t be guaranteed to be the same every time you fire/replay an event.

For example, suppose that the first time you fire an event, the charm gives an error because the relation data provided by the remote end is invalid. If you re-fire this event, it might well be that the remote unit fixed the relation data in the meantime. Or that the unit lost/gained leadership. And so on…

The next step to make this tool more useful, and recorded charm runs to be truly reproducible, is to cache every single piece of data that the charm calls state, including the list above. Only then, we can be assured that we can exactly replicate a code path remotely – or, at that point, even locally.

Idempotency and false negatives

If your charm code is truly idempotent, you should be able to re-run every event any number of times in a sequence without things breaking. However, charm code rarely is truly idempotent because we all make (some justified, some less) assumptions about what events will only ever run once, or will never run before/after some other event, etc…

So it is in practice often justified for a remove hook not to be idempotent and omit checking whether the resources are already released before attempting to re-release them (and raise an exception). Which means that by using fire/replay indiscriminately we may reveal some false negatives: bugs which will never occur in production because juju guarantees, for example, that a remove hook will never run twice on a unit.

Conclusions

We have seen how we can use jhack fire and jhack replay to trigger charm execution given a certain context (specifically: an event). We have seen that this approach is currently severely limited by the amount of charm ‘state’ that we can collect, serialize, and finally “mock” or force-feed to the resurrected charm instance to obtain a perfect replica of a given execution.

Happy hacking!

Thank you for the share @ppasotti . I missed this one.

1 Like

We might do an americas-timezone replica soon, will keep you posted :slight_smile:

2 Likes

@mr-parish and any other interested out there: