How I found out about the perfect event storm

I was hacking around on the tempo-coordinator charm and I wanted to trigger a codepath that was involved in handling the ingress-relation-changed event. So I ran

jhack fire tempo ingress-relation-changed

And then I noticed that the juju status looked busier than I’d expect:

image

I wanted to understand what all those events were about, so I ran:

jhack tail

and I realized that all those units were reacting to changes in the ‘tracing’ relation. I was simulating an ingress-changed event, and the tempo charm, even though the ingress data had not in fact changed, was somehow triggering a cascade of tracing-relation databag changes that in turn woke up all those other units.

In a large deployment this could be an issue, and if other charms were to have the same behaviour, this could result in a fantastic event storm.

So I fired up

jhack show-relation tempo:tracing loki -w

to see what changes tempo was making to the databag to awaken loki, and this showed up:

image

after simulating ingress-changed:

image

It’s subtle but it’s there: the issue is simple: tempo is json-dumping a list in an apparently random order to databag. Every time it does it, there’s a chance the elements will be in a different order and trigger a cascade of relation-changed events.

In some cases order does matter, but in this case it doesn’t.

So here’s our bug and our fix.

Lessons learned

  1. jhack is awesome
  2. always test your databag dumping logic to ensure the outcome is stable
5 Likes

Nice find!