Best practice for coordinating operations across peer units?

cjwatson · 1 April 2023 21:02

In the Launchpad team, we’re currently in the process of converting our legacy bare-metal deployments to use Juju charms. For various reasons, we’re starting out with machine charms that use a config key to specify the workload revision, so units will be upgraded in place except where we’re switching to a new Ubuntu series (it’s possible we’ll change this approach in future to use some kind of switch deployment strategy or Kubernetes or whatever, but for now we’re just trying to dig ourselves out from under a bunch of technical debt and this is the easiest first step).

One thing I’m concerned about is figuring out how to preserve our current zero-downtime deployments. That is, our current deployments involve restarting each of our four appservers one at a time, so the web application on https://launchpad.net/ is always up; secondarily, if one of the appserver processes fails to restart, then that aborts the deployment, so we don’t end up taking down the whole application if we somehow missed something fatal in pre-deployment testing.

I’ve been searching around in vain for best practices on doing this sort of thing with Juju. The documentation on peer integrations has the tantalizing remark that “Peer integration can be useful for coordinating operations against a clustered application without downtime – for example upgrading package versions or applying new configuration”, but without any specific guidance on the best way to implement that sort of thing. I’m also a little put off by what I understand to be quadratic scaling issues with peer relation hooks as the number of units in an application grows, and difficulties with upgrading to new versions of charms that don’t implement a given peer relation if we later decide that they were a bad idea. However, the replacements for peer relations in some cases (such as application databags whose contents are set by the leader) don’t seem to really work here, since coordinating operations across units surely requires two-way communication between leader and non-leader units.

There is of course the option of using actions to coordinate the upgrade instead, so that something outside Juju can arrange to upgrade units one by one and give up if one of them fails. I’m not especially keen on that, though, since it would give up a lot of the simplicity of being able to specify both charm and workload revisions in the bundle and do nearly everything with just a juju deploy, so if there’s some good way for me to arrange that juju config launchpad-appserver build_label=... restarts the actual appserver processes one at a time across units, that would be distinctly preferable.

Does anyone have advice on this, or pointers to charms that do a good job of this sort of thing? (I don’t mind which charm technology they use - I’m happy to translate the basic ideas from whatever framework.)

mthaddon · 3 April 2023 14:46

Charmhub | Deploy Rolling Ops Library and Example Charm using Charmhub - The Open Operator Collection may be of interest to you (the library contained within the charm rather than the charm itself). I’m not very familiar with it though, but let me know if you have specific questions and I can find someone to answer for you.