Handling relation-level errors in a charm

The Problem:

In charming complex applications such as Istio, there are often several different types of relations that the charm needs to support. Each of these relation types may have many (dozens or more) applications that are set up in a deployment. If something goes wrong with one of the related applications, how can that error message be propagated up to an administrator without blocking the entire functionality of the charm, and without a charm author losing their sanity?

As a concrete example, this represents a possible istio-pilot deployment, with one failed relation highlighted in purple:

relation-statuses

Existing Solutions

Right now, we handle this in serialized-data-interface in a very blunt manner:

class Operator(CharmBase):
    def __init__(self, *args):
        super().__init__(*args)

        try:
            self.interfaces = get_interfaces(self)
        except NoVersionsListed as err:
            self.model.unit.status = WaitingStatus(str(err))
            return
        except NoCompatibleVersions as err:
            self.model.unit.status = BlockedStatus(str(err))
            return
        else:
            self.model.unit.status = ActiveStatus()

This is non-ideal for one very big reason: if one app sends bad data, that blocks the entire charm from working until the issue is resolved by an administrator. We could get clever in charm code, and write something like this:

class Operator(CharmBase):
    def __init__(self, *args):
        super().__init__(*args)
        self.ingress = SerializedDataInterface(self, "ingress")
        self.ingress_auth = SerializedDataInterface(self, "ingress-auth")

        self.framework.observe(self.on.ingress_relation_changed, self.ingress)
        self.framework.observe(self.on.ingress_auth_relation_changed, self.ingress_auth)

    def ingress(self, event):
        for app in self.ingress.valid_apps:
            # do something
        if self.ingress.invalid_apps:
            self.unit.status = BlockedStatus("Invalid relations: " + ", ".join(app.name for app in self.ingress.invalid_apps))

    def ingress_auth(self, event):
        for app in self.ingress_auth.valid_apps:
            # do something
        if self.ingress_auth.invalid_apps:
            self.unit.status = BlockedStatus("Invalid relations: " + ", ".join(app.name for app in self.ingress_auth.invalid_apps))

But what happens if both of those need to set BlockedStatus? What code sets ActiveStatus after the issue is fixed? These issues could be addressed by forcing charm authors to write a lot of code at the expense of their sanity, all of which would be obsoleted by the following proposal.

Proposed Solution

Juju should allow setting relation-level statuses, that will play nicely with tools such as juju wait. The high-level picture is something like this:

$ juju status --relations
Model  Controller  Cloud/Region        Version  SLA          Timestamp
demo   uk8s        microk8s/localhost  2.9.16   unsupported  12:34:56-00:00

App                 Version                Status  Scale  Charm               Store       Channel  Rev  OS          Address         Message
istio-pilot         res:oci-image@4707912  active      1  minio               charmstore  stable    55  kubernetes  10.1.1.1
app1                res:oci-image@7654321  active      1  mlflow-server       charmstore  stable     9  kubernetes
app2                res:oci-image@1234567  active      1  mlflow-server       charmstore  stable     2  kubernetes

Unit                   Workload  Agent  Address      Ports     Message
istio-pilot/0*         active    idle   10.1.1.1     8080/TCP
app1/0*                active    idle   10.1.1.2     5000/TCP
app2/0*                active    idle   10.1.1.3     5000/TCP

Relation provider     Requirer       Interface       Type     Message
istio-pilot:ingress   app1:ingress   ingress         regular  
istio-pilot:ingress   app2:ingress   ingress         regular  BlockedStatus("app2 sent invalid data")

Tooling such as juju wait could propagate the error state and display an error, as it does with regular charm statuses today.

From a charm code point of view, this could be as simple as something like:

self.model.relations['ingress']['app2'].status = BlockedStatus('app2 sent invalid data')

That way, the charm code can still set self.model.app.status = ActiveStatus() to represent that the istio-pilot workload is running fine, and the ingress and ingress-auth bits of code don’t have to know or care about that status.

FAQ

  • How would this work with rich statuses in Juju?
    • These features would be complementary. We wouldn’t want to have the user interpret a free-form JSON blob for example, but allowing a charm to set a relation status with rich statuses would work well. However, the interface for a charm author would probably end up being mediated via the ops library, which would probably expose something like the above self.model.relations['ingress']['app2'].status anyway.
1 Like

We already have status-set --app (for apps) and status-set (for units). We could add status-set --rel which would have the same semantics as other relation hook commands - if run in a relation hook the relation id is implicit, else it needs to be specified.

We’d need to be careful around the fact that juju currently uses the relation message to display the joining/joined/suspended etc status. Adding another purely “status” header might be ok, with “message” used for text from the charm. As with unit status - there’s a separate spot for charm vs juju agent content.

2 Likes

I think this statement is a bit of a stretch. It isn’t hard to have one bit of code that notices that there are no invalid X items. And, in fact, you’d want to combine that anyway, as having an invalid app should not get overwritten when evaluating a invalid auth change. And the design of evaluating all relations on every hook (even if they aren’t involved in this conversation) does seem a bit overbroad. (I don’t know why ingress-relation-changed needs to care about ingress-auth-relation-changed for a different application/relation but certainly it would be recommended to have charm event handlers minimize what they have to think about.)

That said, we certainly do want to provide relation status, it has been something that we have discussed and do want to incorporate into the model. It does get tricky because each side of a relation has its own view of how things are going. One side might be perfectly happy but the other is unhappy. Or one unit of an HA application may be happy with the relation, but that doesn’t mean the application itself is in error.