Can I make my Juju controller high available (ha) across multiple LXD hosts?

erik-lonroth · 16 May 2022 16:57

I have three lxd hosts which are not clustered, but is presented to juju as 3 different clouds.

I have a juju controller running in one of the lxdhosts.

Now, my question is if it would be possible to enable ha mode for the controller such as that I can have the slaves running in instances from the other lxd hosts.

How would the correct way to do this be?

I’m thinking along the line:

juju switch controller              #running on lxdhostA
juju add-machine juju-controller-1  #running on lxdhostB
juju add-machine juju-controller-2  #running on lxdhostC
juju enable-ha

But… I’m not sure if this is the way to do this right?

@hmlanigan @tlm

tlm · 16 May 2022 23:29

Hey @erik-lonroth

To my knowledge no this is not possible at the moment. Enabling ha will use the underlying controller model and thus the same cloud as that model.

Interesting idea though. How would you see this working and what are the underlying benefits that you see from this? There are some underlying assumptions made about running Juju in HA such as network reach ability and latency/performance that need to be satisfied in HA operations.

Ta tlm

erik-lonroth · 17 May 2022 19:56

I called up @hallback yesterday and he think it might be possible…

The underlying benefit is that I could get a HA environment even though I wouldn’t have an existing lxd cluster going. Perhaps I only had 2 hosts?

Not everyone can afford 7 hosts to run a proper LXD cloud.

I’m trying to architect a “small lxd” cluster which would still be resilient and still be able to run juju etc. I’ve engaged @hallback in this and will base the whole thing on the stuff @stgraber has posted a video on. But it needs to affordable by a small company.

But until we have this in place, I have to be able to run the juju controller on a second host.

hallback · 18 May 2022 14:52

Well, my idea was taking the three separate LXD hosts you have (as three different clouds now), and then form a LXD cluster and enable controller HA. HA worked, but everything else didn’t, so it was not a good idea after all. This is what happened:

+-------+------------------------+-----------------+--------------+----------------+-------------+--------+-------------------+
| NAME  |          URL           |      ROLES      | ARCHITECTURE | FAILURE DOMAIN | DESCRIPTION | STATE  |      MESSAGE      |
+-------+------------------------+-----------------+--------------+----------------+-------------+--------+-------------------+
| lxd01 | https://lxd01.lxd:9443 | database-leader | x86_64       | default        |             | ONLINE | Fully operational |
|       |                        | database        |              |                |             |        |                   |
+-------+------------------------+-----------------+--------------+----------------+-------------+--------+-------------------+
| lxd02 | https://lxd02.lxd:9443 | database        | x86_64       | default        |             | ONLINE | Fully operational |
+-------+------------------------+-----------------+--------------+----------------+-------------+--------+-------------------+
| lxd03 | https://lxd03.lxd:9443 | database        | x86_64       | default        |             | ONLINE | Fully operational |
+-------+------------------------+-----------------+--------------+----------------+-------------+--------+-------------------+

With the initial controller running on lxd01, it would be possible to enable controller HA like this:

$ juju status
Model       Controller        Cloud/Region   Version  SLA          Timestamp
controller  lxd01-controller  lxd01/default  2.9.29   unsupported  14:13:10Z

Machine  State    DNS          Inst id        Series  AZ  Message
0        started  10.70.41.18  juju-b135a3-0  focal       Running

$ juju enable-ha -n 3 --to lxd02,lxd03
maintaining machines: 0
adding machines: 1, 2

Verify the result:

$ juju controllers --refresh
Controller         Model       User   Access     Cloud/Region   Models  Nodes  HA  Version
lxd01-controller*  controller  admin  superuser  lxd01/default       8     15   3  2.9.29

$ juju status
Model       Controller        Cloud/Region   Version  SLA          Timestamp
controller  lxd01-controller  lxd01/default  2.9.29   unsupported  14:33:38Z

Machine  State    DNS           Inst id        Series  AZ  Message
0        started  10.70.41.18   juju-b135a3-0  focal       Running
1        started  10.70.41.60   juju-b135a3-1  focal       Running
2        started  10.70.41.167  juju-b135a3-2  focal       Running

The controllers would end up on the correct hosts, see the juju-b135a3-N containers:

ubuntu@lxd03:~$ lxc list
+---------------+---------+---------------------+------------------------------------------------+-----------+-----------+----------+
|     NAME      |  STATE  |        IPV4         |                      IPV6                      |   TYPE    | SNAPSHOTS | LOCATION |
+---------------+---------+---------------------+------------------------------------------------+-----------+-----------+----------+
| juju-5f4089-0 | RUNNING | 10.70.41.217 (eth0) | fd42:eb1c:1006:9cab:216:3eff:fecc:fbaf (eth0)  | CONTAINER | 0         | lxd01    |
+---------------+---------+---------------------+------------------------------------------------+-----------+-----------+----------+
| juju-6a6602-0 | RUNNING | 10.70.41.59 (eth0)  | fd42:eb1c:1006:9cab:216:3eff:fe73:3c8b (eth0)  | CONTAINER | 0         | lxd01    |
+---------------+---------+---------------------+------------------------------------------------+-----------+-----------+----------+
| juju-6a6602-1 | STOPPED |                     |                                                | CONTAINER | 0         | lxd02    |
+---------------+---------+---------------------+------------------------------------------------+-----------+-----------+----------+
| juju-6ecb41-0 | RUNNING | 10.70.41.82 (eth0)  | fd42:eb1c:1006:9cab:7c7f:a2ff:fe20:b7f2 (eth0) | CONTAINER | 0         | lxd02    |
+---------------+---------+---------------------+------------------------------------------------+-----------+-----------+----------+
| juju-188846-0 | RUNNING | 10.70.41.109 (eth0) | fd42:eb1c:1006:9cab:216:3eff:fedb:d963 (eth0)  | CONTAINER | 0         | lxd03    |
+---------------+---------+---------------------+------------------------------------------------+-----------+-----------+----------+
| juju-b135a3-0 | RUNNING | 10.70.41.18 (eth0)  | fd42:eb1c:1006:9cab:216:3eff:fe48:ec7 (eth0)   | CONTAINER | 0         | lxd01    |
+---------------+---------+---------------------+------------------------------------------------+-----------+-----------+----------+
| juju-b135a3-1 | RUNNING | 10.70.41.60 (eth0)  | fd42:eb1c:1006:9cab:216:3eff:fefa:df37 (eth0)  | CONTAINER | 0         | lxd02    |
+---------------+---------+---------------------+------------------------------------------------+-----------+-----------+----------+
| juju-b135a3-2 | RUNNING | 10.70.41.167 (eth0) | fd42:eb1c:1006:9cab:216:3eff:fe19:6709 (eth0)  | CONTAINER | 0         | lxd03    |
+---------------+---------+---------------------+------------------------------------------------+-----------+-----------+----------+

This was my idea, but when testing it, I figured out it was bad for two reasons, and I wouldn’t do this on a production system:

When joining a LXD cluster, the new member “forgets” its running containers, so they are not added to the cluster and not visible anywhere. They continue to run and are possible to locate on disk, but you may have bad luck and maybe destroy your storage pool. I don’t know if manually “importing” the containers back is possible. All units will show cannot upgrade machine’s lxd profile: 0: Instance not found
For some reason, models already in use (except the controller model) stops working, and I haven’t tried so hard to fix that. When adding units, the error will be something like retrieving environ: creating environ for model “lxd03mod” (64e1ec58-b1bc-446a-8856-47d065cfbeb8): Get “https://10.70.41.228:8443/1.0”: x509: certificate is valid for 127.0.0.1, ::1, not 10.70.41.228

So now my old models are broken, the containers are lost in LXD, but hey, my controllers are in HA

erik-lonroth · 19 May 2022 10:11

@tlm do you think that I can “empty” a LXD host (B) first by migrating to the first one (A) and after that form a LXD cluster, then repeat the same process for © after which I would have a LXD cluster (A + B + C) formed without losing my running containers?