Performance limitations of peer relations and a solution

Context

The OpenStack team reports that they often have hundreds of units of the same app.

These units need to coordinate among themselves to present a consistent view to some other, “external” application. For example, to select the primary node for the workload, or build a set of machine addresses within the application.

The problem

Each non-leader unit can only set their unit databag in the peer relation.

All units, both leader and non-leader, will wake up with a peer relation event, because Juju doesn’t know what databag content a given unit is interested in: app or unit databag.

In reality, non-leader units exit this hook quickly, and only the leader unit performs some computation.

However, by that time the damage is done, as N units write their bits of informations times N-1 units that each notification must be delivered to.

Juju cannot coalesce events from disparate units, as the remote unit name is passed as an event argument in hook invocation environment.

Linear, quadratic or cubic?

Juju has to process O(N^2) operations. However, there are N units of hardware to run these.

  • The impact on unit agents is O(N).
  • The impact on Juju databag storage is O(N), however…
  • The impact on the Juju database in total could be O(N^2) to record the fact that unit X’s change was seen by unit Y.
  • The impact on the Juju controller is mostly likely O(N^2) in terms of API requests.

Additionally:

  • The mere fact that units join and depart the relation already triggers events.
  • The leader progressively sees 1…N-1 remote units, leading to O(N^2) load on the leader unit, unless local cache is implemented.

However, things could be worse, depending on how the charm is written, especially in the case of charms following the reconciler pattern. Consider these two snippets of code:

# A delta charm
if not self.unit.is_leader(): return
...
values = {event.relation.data[unit].get("foo") for unit in event.relation.units}

vs.

# A hypothetical reconciler charm
values = {the_relation.data[unit].get("foo") for unit in the_relation.units}
...
if self.unit.is_leader(): do_something()

Note that in the latter case, each awoken unit reads the databags of N-1 peer units, imposing the total of O(N^3) operations on the system.

A solution

One pattern that other charming teams use is the coordinator/worker pattern, where:

  • only the coordinator is related to other, “external” applications
  • coordinator is typically run at scale 1, or in HA case, small N
  • workers are related to the coordinator by regular relation
  • workers are scaled to a large M as needed

Because a regular, not a peer relation is used, the N^2 blow-up is avoided.

The cost is having to maintain two separate charms (typically).

Other ideas

Another possibility we briefly discussed was for the leader unit to listen on some port, set the address in the app databag, and have other units send their updates out of band.

This could work, with the caveats that a peer relation is still needed, with joined and departed events, unclear path during Juju leadership change, and having to have some port open, whether or not the workload provides for out-of-band communication.

Additionally, for applications that handle many remote units, whether on a regular or a peer relation, a local cache of the remote data is most likely a good idea.


I would like express my thanks to @chrome0 who alerted our team to the quadratic behaviour and @ppasotti who showed me the coordinator/worker pattern and its use in the Observability charms.

3 Likes

Thank you for taking a look at this.

One additional wrinkle I’d like to point out that can hugely affect runtime is that in larger deployments we’re typically dealing with baremetal, and on those nodes usually it’s not only one charm unit per node but several – eg. observability, HA, timekeeping and network management applications etc. which are all colocated on a hardware node. Now, the juju uniter will serialize hook runs across these, and that means runtime per node will be dependent on not only one but several charms’ behaviour, and all those runtimes will naturally add up.

2 Likes

Even more ideas

In a general sense, an application that needs to coordinate hundreds of units could use a database.

Thanks to @jnsgruk for pointing this out.

Thanks for this post. When we rewrote relation for Juju 4.0, I wondered the purpose of peer relation and had actually the thought that they may be used as “cheap persistent layer for configuration” (replacing etcd, zookeeper or redis in most distributed system I used to work with).

And I wondered if they scale. Apparently no :smiley:

What I wonder now is “Is it another use case for peer relation ?”

1 Like