I think most cases are clear, but there are some cases where it’s not possible to tell whether manual intervention is needed. Say that you can’t proceed because you need a relation - is that blocked (human needs to run “integrate”) or waiting (Juju needs to set up the integration that has already been requested)?
This topic came up when @benhoyt and I were discussing early analysis into existing usage of defer()
. I had suggested that going into error state (perhaps in a more graceful way than ops
currently provides for, e.g. no traceback in debug-log) would alleviate some of the need for “juju retry” (he convinced me otherwise, mostly with the same arguments that are in this thread).
Although either I misunderstood him then or his stance has softened over the last couple of weeks, since:
I do feel this is the most compelling case for me (one that @sed-i also mentions). With “purely” Juju events, as far as I understand, you should be able to rely on the snapshot of the model being available. That’s not really the case with Pebble - there are plenty of reasons why your Pebble call might temporarily fail, so you need code like this everywhere (not just pebble-ready, I think):
try:
container.replan() # or exec, or add_layer, etc
except ops.pebble.ConnectionError: # probably also ChangeError, etc
logger.warning(...)
event.defer()
Or if you’re willing to have the hook take longer:
@tenacity.retry(
retry=tenacity.retry_if_exception_type(ops.pebble.ConnectionError),
wait=tenacity.wait_fixed(2) + tenacity.wait_random(0, 5),
stop=tenacity.stop_after_delay(10),
reraise=True,
)
def pebble_thing():
...
try:
pebble_thing()
except ops.pebble.ConnectionError:
self.unit.status = ops.BlockedStatus("Can't talk to Pebble") # WaitingStatus? Hard to know if this fixes itself.
return
(Even then that might need to defer()
, since there’s no (jhack-less) way to say "run that event again, Pebble’s ok now. The status is also awkward to set - it’s a “pull” status in that you can do can_connect()
but that introduces a race where your main handler failed and your collect status handler succeeded, so really you have to get that failure to your collect status handler some other way).
It does still seem cleaner to me in this case to let the hook exit non-zero and have Juju retry the event and show the state as in error (it is in error - something is broken with Pebble!).