Synchronizing on-demand tasks across charm units

lucabello · 21 March 2025 11:24

You’re charming a workload on Kubernetes, and want to write an action that allows a user to perform a task.

DiscourseSequenceDiagrams01.drawio

Simple enough! Then you think: “this task is pretty heavy, it’d be great if I could scale my application and split the load among the units”.

This is slightly more complex, because Juju actions run on units, not on apps. But hey, not a problem! You can:

have the action to be only callable on the leader unit;
put some sort of app status in a peer relation, together with the action data (fruit=coconut^[1]) ;
on relation-changed, run your thing with either pebble.exec or by setting a Pebble layer.

DiscourseSequenceDiagrams02.drawio

Not too bad, right? The issue with this mechanism is that there is no guarantee on the timing between different units starting their task. What if this is important? What if instead of a silly example of an action eating coconuts you had load tests that must start at the same time?

This is where it becomes a nightmare.

Some context on the real problem: `k6` and load tests

I encountered this challenging design when writing a charm for k6, a tool to easily write and run load tests. Charming it up has several advantages, such as:

you can easily load test other charms;
if you’re out of resources (load generation is quite resource-heavy), you can simply scale up the application to split the load.

However, if you’re splitting load generation for a load test, it’s critical that all the “smaller tests” start at the same time. But how do we ensure that, if we have no guarantees on event timings?

We need to start all the tests from the leader unit. Luckily, k6 can start a test in --paused-mode; doing so also runs a REST API server that allows you to resume a test via an HTTP request. All we need to do is:

start all tests in --paused-mode;
have the leader /resume all the tests at the same time.

Sound easy enough, right? Wrong.

Solving the problem: RPC in peer data

For the leader unit to correctly understand when to start a test, we need to keep some sort of state in the peer relation. Not only we need to keep track of which units are “idle”, but also whether a test is starting but is --paused — i.e., the unit is “busy”, has already started, or just finished running.

If you, the audience, have any questions, this would be a great time to ask. Oh, you do? Awesome!^[2]

How does the leader unit tell all the units it’s time to start a test?
The leader puts the test information in peer data (app), together with a “busy” status for the app AND a “busy” status for the unit, triggering a relation-changed event that wakes up all units.

How does a unit decide what’s happening when woken up by relation-changed?
Units need to check if the status in peer data (app) is “busy”. If so, they can set and start a Pebble layer for k6 in --paused mode.
Then, units can set their status to “busy” in peer data (unit), triggering another relation-changed event (per unit).

How does the leader know when to start (/resume) the tests?
As units are setting “busy” in peer data (unit), the leader is woken up every time. The app status is also “busy”, so the leader knows it’s waiting for all other units to be ready. On each relation-changed, the leader checks peer data (unit) for all units, until eventually they are all “busy”.
When that happens, the leader sends the HTTP request to all the units to start (/resume) the test.

How does a unit change its state back to “idle” after the test is done? Instead of setting the Pebble layer to execute k6 run ..., we set it to k6 run ... && pebble notify k6.com/done. Pebble notices only wake up the unit they’re running on: this gives us a chance to set “idle” in peer data (unit), triggering relation-changed events from each unit.

How does this end? The leader is woken up as units are finishing their tests; eventually, when all of them are “idle”, the leader sets the app status back to “idle” and removes the test information from peer data (app).

The whole mechanism can be hard to parse from text. Here’s a diagram showing how the task synchronization works!

DiscourseSequenceDiagrams03.drawio

Conclusions

The code handling this pattern currently lives in the k6-k8s-operator charm, but if useful it would be easy to generalize into a library. This is a pattern I’ve not encountered before, and — much like our coordinated workers pattern we use for our HA solutions — it could benefit from becoming a library if it’s used in multiple places.

Hope this was helpful!

Coconuts are fruit; however, if you’re a fan of loose definitions, coconuts can also be considered nuts and seeds: https://www.loc.gov/everyday-mysteries/agriculture/item/is-a-coconut-a-fruit-nut-or-seed/ ↩︎
I didn’t know how to introduce the next paragraphs, but I’m the writer here so I can just make up an audience okay — give me a break ↩︎

dimaqq · 24 March 2025 11:38

Off-topic: would you consider re-uploading the figures with background colour? The way it is now, it’s hard to understand the diagrams in dark mode.

Synchronizing on-demand tasks across charm units

Some context on the real problem: k6 and load tests

Solving the problem: RPC in peer data

Conclusions

Some context on the real problem: `k6` and load tests