What is the recommended action to take when a charm does not find a resource that is critical for it work? For example, a resource that contains the binaries for the service provided by the charm. Should I set the status as Blocked and wait for a human to supply it? Or should the charm try to download from a known URL?
Is there a way to verify the integrity of that resource? When we install packages on Linux Distros, the package manager usually verifies the PGP signature of the packages with a know list. This way we can verify the integrity and have some level of guarantees that the package was not modified by a third party. How to have something similar with charm resources?
2b) Same question, but for charms: can we pgp sign a charm and upload to the store?
What is the overhead on the controllers and network? If I juju add-unit foo -n 137, the Juju Controllers will supply the resource to 137 new units simultaneously? If this floods the NICs of the controllers, what happens to the other models and units and debug-logs? We are seeing some weird instabilities in our controllers, we don’t feel comfortable adding more load to them.
Where exactly the charm resource is stored in the controllers?
The charm resources are part of the periodic controller backup?
Blocked is indeed used to indicate the human intervention is needed to correct some deployment problem. Charms and their resources are published to the store as a single source of information. This works well in situations where a firewall rule or proxy is in use, and also ensures the published tuple of charm/resource revisions is guaranteed to work together. Having the charm go outside that and hit some other URL could easily break the air-gapped deployment scenario.
Resources are published with metadata containing a checksum, and Juju internally verifies this when the resource is cached on the controller and also when it is sent to a workload machine. I am not sure off hand if charm upload has the same feature.
In the next release of Juju 2.9 (2.9.11), the Juju controller will serialise attempts to download resources from the store. They are cached on the controller, but we were servicing multiple simultaneous requests concurrently and so the controller could fetch the same blob several times from the store. However, there’s currently no throttling on serving the resources from the controller to any units which request them.
The resources are stored in a mongodb gridfs databse called “blobstore”. Resource metadata ts stored in the “resources” collection in the “juju” database.
About 2: Can I manually verify the checksums of the charms and resources? How can I check if what I have running after juju deploy is in fact what the charm author created?
About 3: Is there a way to limit the number of units that request a resource from the controllers? To throttle the number of simultaneous units requesting a resource? Is something like this planned? We really do not want to DoS our own infrastructure.
You can see the sha384 hash info of the resources for an application using the juju resources command. When a charm fetches a resource using resource-get the blob is saved to a resources directory under the charm directory so you could find the sha384 and compare.
There’s not currently any concrete plans to throttle units fetching resources from the controller, but if it becomes an issue we can look at it for sure.
Juju resources have needed attention for quite some time. The first time I started hitting issues with simultaneous charm resource downloads was when deploying spark/hadoop and sending the spark.tar.gz and the hadoop.tar.gz as charm resources (each ~200MB), I initially filed a bug here, although this is different than what we see now. Concerning that bug^ (I think it can be closed as we don’t get download failures anymore, but instead a different issue), we would get download failures when multiple agents tried to grab resources from the controllers simultaneously and we would have to implement some sort of retry in the charm code. We currently experience a different issue, where multiple simultaneous agent resource request just entirely take down the controllers. It was when sending the slurm snap as a resource (also ~200MB) when we first started downing the environment via juju resources. We would upgrade the snap resource and it would lock everything up with no sign of coming back. The resources primitive in juju are entirely unusable in environments with more than a few units, especially if you don’t have massive juju controller infrastructure with fast disk and big pipe. Bringing this problem to light has been high on my squeak list for some time now.
The bug says that it is resolved via https://github.com/juju/juju/pull/13215, but that PR addresses resources being pulled from the store multiple times, not the “thundering heard” simultaneous agent resource download issue. Possibly that bug can be re-opened if I’m reading this correctly?
Yeah, we have done a fix for the issue where the controller hits the store multiple times simultaneously, but have not yet done work to throttle the thundering herd problem of many agents asking for resources. The closed bug (IIANM) was raised due to the controller->store issue, not the agent->controller one. We’ll have to try and get that latter issue addressed.