Would a WorkloadReadyEvent be useful?

tony-meyer · 2 October 2024 22:13

An issue I’ve seen a few times (here’s a post from @sed-i for example) is a charm wanting to be woken up to do work when the application (workload) is ready, rather than when Pebble or Juju is ready.

If you configure Pebble checks, you do get a PebbleCheckRecovered event in Juju 3.6. However, it’s a race as to whether you’ll get this when the application is first ready: if it’s slow enough to be ready that the checks start failing then you will; if it’s fast enough that the checks don’t fail (often enough to hit the threshold) then you won’t, because it’s a recovered event, not a passed event.

@ppasotti has built a “tempo ready” event using Pebble custom notices (I think you could do this with a check as well). The key bits are roughly this handler:

def _on_workload_pebble_custom_notice(self, event: PebbleNoticeEvent):
        if event.notice.key == self.workload_ready_notice_key:
            self.workload_container.stop("workload-ready")
            # more things to do here based on the workload being ready

and roughly this layer, which hits a /ready endpoint to determine if the workload is ready:

return Layer(
    {
        "services": {
            "workload-ready": {
                "override": "replace",
                "summary": "Notify charm when workload is ready",
                "command": f"""watch -n 5 '[ $(wget -q -O- http://localhost:{self.workload_http_server_port}/ready) = "ready" ] &&
                           ( /charm/bin/pebble notify {self.workload_ready_notice_key} ) ||
                           ( echo "workload not ready" )'""",
                "startup": "disabled",
            }
        },
    }
)

If the application has the ability to configure a hook when ready (like an nginx init script) then you can obviously do this more simply by just adding a pebble notify there. However, there are a lot of applications that don’t offer this.

Question 1: does your charm already have some sort of “application ready” event, and if so, how are you implementing that? Links to examples would be really great; even if the answer is “we can already do this” then a how-to guide that includes links to a bunch of examples would be useful.

Question 2: if you don’t have an “application ready” event, is this something you wished you had? (K8s only? Machine only? Machine and K8s?)

Question 3: do your applications have a way to distinguish between “running” and “ready”? It would be great to have some examples of what this is - are you looking for a specific log message, does answering on a port mean “ready”, is there a /ready endpoint, etc. Pebble/Juju/ops isn’t going to be able to solve this if you don’t, but it would be helpful to know what sort of thing you’d trigger on.

Somewhat related to this, Pebble checks are currently either “up” or “down”, even if they have not run for the first time yet. If they have a level then for K8s we need to transform whatever the statis is into a binary answer, but question 4 would you find checks more useful if there was a different initial state, like “pending” or “unknown”?

ppasotti · 4 October 2024 13:56

I was thinking to extend this mechanism with a two-way one:

service 0: runs the server

service 1: checks that endpoint is up, when it is notify

service 2: checks that the endpoint is down, when it is notify

on restart:

start service 0
start service 1

on notice 1:

stop service 1
start service 2

on notice 2:

stop service 2
start service 1

this would effectively give you a notice every time the status of the server changes from up to down and from down to up, similar to the check-recovered and check-failed mechanism in modern pebble, but implemented with notices (and without thresholds).

sed-i · 7 October 2024 17:21

Thanks for resurfacing this @tony-meyer!

A while ago I was thinking of using urlwatch as an additional service that would create pebble notices.
I wonder if there’s anything we could change about the spec of PebbleCheckRecovered. In my mind it is useful to use “rising edge” and “falling edge” analogies. PebbleCheckRecovered sounds like a rising edge trigger, and it seems that it should be issued even if the check never failed before.
Existing use case: WAL replay during startup may take a long time.
That being said, I feel like we need to have concrete examples for what would we do with such an event. For example, it doesn’t make sense to publish over relation data the URL to prometheus only after it is ready, because if prometheus crashes in the future, the URL is still in relation data, and I’m not sure we should treat the first startup sequence differently.

dimaqq · 10 December 2024 01:56

Re edge interrupts, yes check failed and recovered are edge-triggered events, and today they are triggered on 1 → 0 and 0 → 1 transitions.

The question is how to handle the initial state, undefined.

One option, as @sed-i rightfully suggests, would be to redefine initial state as 0 and trigger check recovered event as soon as check succeeds N times, same as recovery after failure.

Another option, what @tony-meyer is asking about in the OP, is to have a separate event on the undefined → 1 transition. I think that implies that check failed is issued on the undefined → 0 transition.

One bit that’s unclear to me is if there would be a single event per workload, or a single event per check. In other words, what should happen if multiple checks are set up?

Taking the prometheus example, I was thinking:

pebble ready: a gate to (configure and) replan to kick off prometheus workload
workload ready: a gate to declare prometheus healthy, may now publish it’s address on a relation

The long startup (wal replay) case, perhaps like this:

have two separate checks
- service is running (may still be busy with wal)
- service is ready (wal replay done)
the earlier may be used for something internally
the latter is a gate to publish address on the relation