Questions about charm resources

jamesbeedy · 13 August 2021 14:09

Juju resources have needed attention for quite some time. The first time I started hitting issues with simultaneous charm resource downloads was when deploying spark/hadoop and sending the spark.tar.gz and the hadoop.tar.gz as charm resources (each ~200MB), I initially filed a bug here, although this is different than what we see now. Concerning that bug^ (I think it can be closed as we don’t get download failures anymore, but instead a different issue), we would get download failures when multiple agents tried to grab resources from the controllers simultaneously and we would have to implement some sort of retry in the charm code. We currently experience a different issue, where multiple simultaneous agent resource request just entirely take down the controllers. It was when sending the slurm snap as a resource (also ~200MB) when we first started downing the environment via juju resources. We would upgrade the snap resource and it would lock everything up with no sign of coming back. The resources primitive in juju are entirely unusable in environments with more than a few units, especially if you don’t have massive juju controller infrastructure with fast disk and big pipe. Bringing this problem to light has been high on my squeak list for some time now.

Edit: some previous squeaking about this issue

And the bug for the issue Use a key-based mutex lock to download resources/agent bins once. by hpidcock · Pull Request #13215 · juju/juju · GitHub

The bug says that it is resolved via Use a key-based mutex lock to download resources/agent bins once. by hpidcock · Pull Request #13215 · juju/juju · GitHub, but that PR addresses resources being pulled from the store multiple times, not the “thundering heard” simultaneous agent resource download issue. Possibly that bug can be re-opened if I’m reading this correctly?