Context
The OpenStack team reports that they often have hundreds of units of the same app.
These units need to coordinate among themselves to present a consistent view to some other, “external” application. For example, to select the primary node for the workload, or build a set of machine addresses within the application.
The problem
Each non-leader unit can only set their unit databag in the peer relation.
All units, both leader and non-leader, will wake up with a peer relation event, because Juju doesn’t know what databag content a given unit is interested in: app or unit databag.
In reality, non-leader units exit this hook quickly, and only the leader unit performs some computation.
However, by that time the damage is done, as N units write their bits of informations times N-1 units that each notification must be delivered to.
Juju cannot coalesce events from disparate units, as the remote unit name is passed as an event argument in hook invocation environment.
Linear, quadratic or cubic?
Juju has to process O(N^2) operations. However, there are N units of hardware to run these.
- The impact on unit agents is O(N).
- The impact on Juju databag storage is O(N), however…
- The impact on the Juju database in total could be O(N^2) to record the fact that unit X’s change was seen by unit Y.
- The impact on the Juju controller is mostly likely O(N^2) in terms of API requests.
Additionally:
- The mere fact that units join and depart the relation already triggers events.
- The leader progressively sees 1…N-1 remote units, leading to O(N^2) load on the leader unit, unless local cache is implemented.
However, things could be worse, depending on how the charm is written, especially in the case of charms following the reconciler pattern. Consider these two snippets of code:
# A delta charm
if not self.unit.is_leader(): return
...
values = {event.relation.data[unit].get("foo") for unit in event.relation.units}
vs.
# A hypothetical reconciler charm
values = {the_relation.data[unit].get("foo") for unit in the_relation.units}
...
if self.unit.is_leader(): do_something()
Note that in the latter case, each awoken unit reads the databags of N-1 peer units, imposing the total of O(N^3) operations on the system.
A solution
One pattern that other charming teams use is the coordinator/worker pattern, where:
- only the coordinator is related to other, “external” applications
- coordinator is typically run at scale 1, or in HA case, small N
- workers are related to the coordinator by regular relation
- workers are scaled to a large M as needed
Because a regular, not a peer relation is used, the N^2 blow-up is avoided.
The cost is having to maintain two separate charms (typically).
Other ideas
Another possibility we briefly discussed was for the leader unit to listen on some port, set the address in the app databag, and have other units send their updates out of band.
This could work, with the caveats that a peer relation is still needed, with joined and departed events, unclear path during Juju leadership change, and having to have some port open, whether or not the workload provides for out-of-band communication.
Additionally, for applications that handle many remote units, whether on a regular or a peer relation, a local cache of the remote data is most likely a good idea.
I would like express my thanks to @chrome0 who alerted our team to the quadratic behaviour and @ppasotti who showed me the coordinator/worker pattern and its use in the Observability charms.