Juju deploy optimization

Hybrid512 · 25 March 2021 09:47

Hi,

I have some questions about ways to optimize the deploy process with Juju

I mean, Juju works great but I have a huge deployment that I split in parts using overlays and when adding new apps/units, it can take a long time before a new LXD container is started even though there is no load pressure on the hosts

Sometimes I deploy maybe 3 apps at once with 3 units each (so 9 LXD containers on 3 different hosts that are already up, configured and running, no new bare metal machine spawned here) but it gives me the impression that juju deploy apps one after the other waiting for the first to come up before spinning the other and that wastes a lot of time for apps that are not always related.

Are there any options that can be tuned to make juju a little faster ?

Best regards.

afreiberger · 25 March 2021 18:28

In my experience with juju operations, there are a few serial operations, but most operations are parallelized as much as possible.

When deploying new LXDs on existing metals, there will be serialization of the creation of each of those LXDs as the jujud-machine-X unit handles the creation and configuration of the lxds. Those lxds, after being defined within your metals should all install ubuntu and patch themselves in parallel, and then any actions happening within those LXDs for a single application should be parallelized across all units of the application.

Where it gets a little trickier is when you also have subordinate applications to those deployed in the LXD (such as nrpe for monitoring) where hooks of units within the same machine context (per-lxd or per-metal) are managed by the juju_machine_lock. You can inspect whether there are hooks/application-units holding the juju_machine_lock or waiting for the lock by logging into the machine and running “juju_machine_lock”. These would be unavoidable in current juju 2.8 and below models, however I believe there is work being done for future versions of juju to allow charm developers to specify when a machine-wide lock would be necessary rather than being a default lock that happens during every hook execution.

I often see slowness in the deployment of new LXDs in my development environments and have found there are a couple of tweaks to make them somewhat faster or at least have less spin-up overhead, and that is to configure the model’s container-image-stream from “released” to “daily” and set enable-os-upgrade to “false” in the model-config and perhaps in model-defaults if you want it to apply to new models. This is not the best choice for a production environment, but it’s useful for quick setup and teardown of test environments.

You may also try pre-seeding the lxd images during your metal deployments with the model-config cloudinit-metadata like:

postruncmd:
  - if hostname |grep -qv lxd; then wget --tries=15 --retry-connrefused --timeout=15 --random-wait=on -O /home/ubuntu/ubuntu-18.04-server-cloudimg-amd64.squashfs https://cloud-images.ubuntu.com/releases/bionic/release/ubuntu-18.04-server-cloudimg-amd64.squashfs; fi
  - sleep 30
  - if hostname |grep -qv lxd; then lxc image import /home/ubuntu/ubuntu-18.04-server-cloudimg-amd64-lxd.tar.xz /home/ubuntu/ubuntu-18.04-server-cloudimg-amd64.squashfs --alias juju/bionic/amd64; fi

That preseeding of lxd images is most useful for offline deployments where you may not be able to reach the cloud-images server directly from the machines and need to use a local source for your lxd images. If your environment has slow connectivity to the Ubuntu mirrors, setting enable-os-upgrade=false and hosting a local copy of the lxd image may help improve your deployment times.

Also, care should be taken in considering the performance of the storage device where /var/lib/lxd is mounted to ensure that the containerized applications will have enough i/o resources to run co-resident on the metal with other containerized applications.

You may find some benefit to using a local caching proxy for the apt-http-proxy and apt-https-proxy variables in the model-config as well, since each new lxd will be pulling roughly the same packages and updates (namely those security patches released since the latest image release and the day you’re performing your deployment). Use of the local cache could reduce your internet-edge traffic from 9 downloads of the same packages down to 1 copy downloaded remotely and the remaining 8 queries fetching the packages from the proxy cache. If your metals are MAAS managed, I might suggest trying the squid proxy on the MAAS host as the apt-http(s)-proxy in your model.

Hybrid512 · 26 March 2021 09:43

Thanks for the tips !
I’ll try some of them but I don’t think this is related to network bandwidth because I already use a proxy for apt/snap/…