Can't bootstrap with lxd on Arch Linux (EndeavourOS)

ianmjones · 4 October 2021 20:22

Has anyone been able to boostrap a localhost (LXD) controller on Arch Linux?

I get the following error regardless of whether I’m using the lxd snap (which I tried with first), or with the community package of lxd (configured for unprivileged containers). Both versions of lxd are able to launch a focal container with no problems.

I’ve tried on both hardware install and vm install (virtualbox), no dice with either.

Installing Juju machine agent
2021-10-04 19:41:18 INFO juju.cmd supercommand.go:56 running jujud [2.9.15 0 6a0461b47391cdb3464418f3eb58928d65a26773 gc go1.14.15]
2021-10-04 19:41:18 INFO juju.agent identity.go:22 writing system identity file
2021-10-04 19:41:18 ERROR juju.mongo mongo.go:649 could not set the value of "/proc/sys/net/core/somaxconn" to "16384" because of: "/proc/sys/net/core/somaxconn" does not exist, will not set "16384"
2021-10-04 19:41:18 ERROR juju.mongo mongo.go:649 could not set the value of "/proc/sys/net/core/netdev_max_backlog" to "1000" because of: "/proc/sys/net/core/netdev_max_backlog" does not exist, will not set "1000"
2021-10-04 19:41:18 ERROR juju.mongo mongo.go:649 could not set the value of "/sys/kernel/mm/transparent_hugepage/enabled" to "never" because of: open /sys/kernel/mm/transparent_hugepage/enabled: permission denied
2021-10-04 19:41:18 ERROR juju.mongo mongo.go:649 could not set the value of "/sys/kernel/mm/transparent_hugepage/defrag" to "never" because of: open /sys/kernel/mm/transparent_hugepage/defrag: permission denied
2021-10-04 19:41:18 WARNING juju.mongo mongo.go:479 overwriting args.dataDir (set to /var/lib/juju) to /var/snap/juju-db/common
2021-10-04 19:41:18 INFO juju.mongo mongo.go:484 Ensuring mongo server is running; data directory /var/snap/juju-db/common; port 37017
2021-10-04 19:41:18 WARNING juju.mongo service.go:326 configuring mongod  with --noauth flag enabled
2021-10-04 19:41:18 INFO juju.packaging manager.go:103 installing "juju-db" via "snap"
2021-10-04 19:41:18 INFO juju.packaging.manager run.go:88 Running: snap install  --channel 4.0/stable juju-db
ERROR failed to start mongo: juju-db snap not installed correctly. Executable /snap/bin/juju-db.mongod not found
ERROR failed to bootstrap model: subprocess encountered error code 1

Above was using:

stable juju snap v2.9.15
stable lxd snap v4.18

manadart · 7 October 2021 07:38

Try launching a container directly and installing the juju-db snap in it.

If Mongo is not usable, we can interrogate the logs to get to the bottom of it.

Alternatively, use --keep-broken with your bootstrap command to achieve the same thing. The container will hang about for you to poke at.

wallyworld · 7 October 2021 09:29

This is fixed in Juju 2.9.16 candidate just published. See bug https://bugs.launchpad.net/bugs/1945752

Basically, mongodb --version printed a groups warning along with the version number which confused juju verifying that the snap was installed.

ianmjones · 7 October 2021 19:29

Thanks @wallyworld, it almost fixed the issue …

… but I found I needed to use this --keep-broken option along with the 2.9.16 candidate to get things working.

After another failure with the same issues after refreshing the juju snap to the candidate version, I rebooted the machine just to make sure all was squeaky clean, and then ran a bootstrap of a “dev” controller with --keep-broken so that I could have a poke around when it failed.

However, it didn’t fail, and I was then able to deploy juju-hello and check it worked etc.

I then re-tested by bootstrapping another “test” controller without the --keep-broken option, twice, both times it failed.

I then ran a final bootstrap for the “test” controller with --keep-broken again, and that worked:

I at least now have a workaround and can use Juju on EndeavourOS, thanks for your help @wallyworld and @manadart.

If you need any more info or want me to try anything for you, fire away!

wallyworld · 7 October 2021 22:28

The fact this is intermittent, and the error message “cannot dial mongo to initiate replicaset: no reachable servers” could indicate that mongo is taking “too long” to start up. When Juju calls the upstream mongo api to make the connection, it passes in a timeout of 5 minutes; it that timeout passes I think that returns the “no reachable servers” error. Unfortunately the timeout is not configurable. Is the controller vm sufficiently powerful enough or perhaps it is CPU constrained or I/O constrained so mongo takes too long to start up?

ianmjones · 8 October 2021 07:59

I don’t think so? ¯\(ツ)/¯

I’ve been hitting this issue on a laptop with 4 cores, 16Gb of RAM, and a couple of SSDs. Previous to running EndeavourOS on this laptop I’ve generally been running derivatives of Ubuntu 18.04 or 20.04 and not had any problems bootstrapping multiple controllers and running many units containing clusters of CockroachDB, Caddy, and a Go based app I wrote, along with other “play” units.

On the VirtualBox vm I tested last night and showed a screenshot of, that vm had 4 “cpus”, 16Gb RAM, and is running on a machine with all SSDs, 12 cores (24 threads), and 128Gb of RAM.

As seen in that screenshot above, it’s definitely a timeout after 5 mins that is visible on error, it feels like an age waiting for it fail! When the bootstrap succeeds it is far quicker.

After posting the message about --keep-broken “fixing” things last night, I also successfully bootstrapped a couple of controllers on the laptop.

However, when bootstrapping a “stage” controller with --keep-broken on the vm this morning, it failed.

I ran it as time juju bootstrap localhost stage --keep-broken, so could see that it took a very long 7m45s:

During the process I kept an eye on memory usage etc and didn’t see anything getting constrained, this screenshot was taken just after it failed, the cpus never got near 100% and the memory was basically flat the entire time:

It feels to me like a cgroups v2 issue more than anything, but that’s just a hunch. ¯\(ツ)/¯

I’ve still got the failed juju-85d839-0 container hanging around in the vm if there’s any logs etc you’d like me to grab?

wallyworld · 10 October 2021 23:28

You could add --debug to the bootstrap command to get a little more verbose logging. But it just seems like mongo is having trouble starting. It would be interesting to see the results of starting a mongo shell and running rs.config(). Assuming mongodb did start, you can connect using something like

/snap/bin/juju-db.mongo 127.0.0.1:37017/juju --sslAllowInvalidCertificates --authenticationDatabase admin --ssl --username machine-0 --password <password>

where password is found by getting the “statepassword” value from the /var/lib/juju/agents/machine-0/agent.conf file.

“no reachable servers” comes from the underlying mongo api library so something with mongo’s startup is failing. Maybe the mongodb logs have something useful.

manadart · 13 October 2021 07:52

keep-broken won’t do anything to fix this issue, all it does is keep the container if bootstrap fails instead of tearing it down.

If it fails, you can then lxc exec <container> bash and interrogate logs.

I would look at:

/var/log/cloud-init.log
/var/log/cloud-init-output.log
/var/log/syslog (I believe Mongo writes logs there).

jameinel · 13 October 2021 13:39

Given the point at which it is failing (juju is failing to install mongodb via snap inside the container), I’m guessing the issue is that snaps-in-lxd-on-Arch is not supported. I know that has been a pain point elsewhere (they also don’t work back on Trusty), and there are times when the chained app-armor rules don’t work well.

I would probably try without Juju in the mix, and just spinning up an LXD container and seeing whether you can install a snap inside (juju-db is obvious, but is there something about that snap, or is it just any snap).

heitor · 13 October 2021 15:45

Is it possible to install Juju without snaps at all? I’d love to have native builds for my OS, but I was only able to install it on Ubuntu…

ianmjones · 14 October 2021 08:09

Thanks @wallyworld.

I’ve run a bunch of bootstraps on the VM and my main machine (now also running EndeavourOS) with that --debug option.

Long story short, the VM sometimes worked, sometimes not, but on my 24cpu/128Gb main machine it always worked.

This very much supports your theory that it may simply be down to either CPU or IO constrains causing the MongoDB initial startup to be just slow enough that it times out.

Ran a few tests, snaps work fine, but do complain about cgroups v2…

root@focal:~# snap install hello-world
hello-world 6.4 from Canonical✓ installed
root@focal:~# hello-world
WARNING: cgroup v2 is not fully supported yet, proceeding with partial confinement
Hello World!
root@focal:~#

I’ll keep this in mind if I manage to get any more breakage, thanks.