Agent operations must be resilient

timClicks · 3 March 2020 13:29

Part of https://discourse.jujucharms.com/t/read-before-contributing/47, an opinionated guide by William Reade

Agent operations must be resilient

So, there are hard constraints we must not violate, but the user-input
problem is well bounded – client, server, DB, done. Everything else
that juju does is happening in the real world, where networks go down
and processes die and disks fill up and senior admins drop hot coffee
down server vents… i.e. where

EVERYTHING FAILS

and we somehow have to make a million distinct things happen in the
right order to run, e.g., a huge openstack deployment on varying
hardware.

So, beyond the initial write of user intent to the DB, all the code
that’s responsible for massaging reality into compliance has to be
prepared to retry, again and again, forever if necessary. This sounds
hard, and it is really hard to write components that retry all
possible failures, and it’s a bloody nightmare to test them properly.
So: don’t do that. It’s a significant responsibility, and mixing it into
every type we write, as we happen to remember or not, is a prescription
for failure.

Thankfully, you actually don’t have to worry about this in the usual
course of development; the only things you have to do are:

(make sure your task is resumable/idempotent, which it always has to
be whatever else you do, because EVERYTHING FAILS)
encapsulate your task in a type that implements worker.Worker, and
make a point of failing out immediately at the slightest provocation
run it inside some agent’s dependency.Engine, which will restart the
task when it fails; and relax in the knowledge that the engine will
also be restarted if it fails, or at least that it’s not your
problem if it isn’t, and it will be addressed elsewhere

…and they’re completely separable tasks, so you can address them one
by one and move forward with confidence in each part. Remember, though,
if you eschew any above step… you will have trouble. This is
because, in case you forgot,

EVERYTHING FAILS

and sooner or later your one-shot thing will be the one to fail, and you
will find that, whoops, having things happen reliably and in a sensible
order isn’t quite as simple as it sounds; and if you have any sense you
will get your one-shots under proper management and devote your effort
to making your particular task idempotent or at least resumable, rather
than re-re-reinventing the reliability wheel and coming up with a
charming new variation on the pentagon, which is a waste of everyone’s
time.

Agent operations must be *resilient*

Agent operations must be resilient

Agent operations must be resilient