I was hacking around on the tempo-coordinator
charm and I wanted to trigger a codepath that was involved in handling the ingress-relation-changed
event.
So I ran
jhack fire tempo ingress-relation-changed
And then I noticed that the juju status looked busier than I’d expect:
I wanted to understand what all those events were about, so I ran:
jhack tail
and I realized that all those units were reacting to changes in the ‘tracing’ relation.
I was simulating an ingress-changed
event, and the tempo
charm, even though the ingress data had not in fact changed, was somehow triggering a cascade of tracing-relation databag changes that in turn woke up all those other units.
In a large deployment this could be an issue, and if other charms were to have the same behaviour, this could result in a fantastic event storm.
So I fired up
jhack show-relation tempo:tracing loki -w
to see what changes tempo
was making to the databag to awaken loki, and this showed up:
after simulating ingress-changed:
It’s subtle but it’s there: the issue is simple: tempo is json-dumping a list in an apparently random order to databag. Every time it does it, there’s a chance the elements will be in a different order and trigger a cascade of relation-changed events.
In some cases order does matter, but in this case it doesn’t.
So here’s our bug and our fix.
Lessons learned
jhack
is awesome- always test your databag dumping logic to ensure the outcome is stable