Using cloud-init to create a VRF

hey juju people

I have a question about juju and VRFs, currently I have a cloud-init script which configures a VRF for the controller subnet, this works fine and as expected in my lab - but when configuring by cloud-init a somewhat expected condition is reached, let me first provide the context:

I have this setup with my VRF:

' 
'                     +-------------+                                                  +-----------------+
'                     |             |                                                  |                 |
'                     |             |                                                  |                 |
'                     |             |                                                  |                 |
'                     |             |                                                  |                 |
'                     |  controller |                                                  | node 0          |
'                     |             |                                                  |   +-----------  |
'                     |             |                                                  |   |ip exec vrf jujud
'                     |             |                                                  |   |          |  |
'                     |             |                                                  | +-+----------+  |
'                     |             |                                                  | |vrf: mgmt      |
'                     +-----+-------+                                                  +-+----+---------++
'                           |                                                            +--+-+         |
'                           |                                                               |           |
'                           |                                                               |           |
'                           |                                 vlan 32                       |           |
'                   +-------v--------------------------------------------------------------->-----------+-----------+
'                   +-----------------------------------------------------------------------------------+-----------+
'                                                                                                       |
'                                                             vlan 33                                   |
'                   +----------------------------------------------------------------------------------->-----------+
'                   +-----------------------------------------------------------------------------------------------+
' 
' 
'
' ++
' ++

I’ve tested this in my lab post juju add-machine, this works fine, and even manages to discovery/map spaces correctly, the problem I have is doing this setup in a cloud-init script.

Currently my script will do these steps during machine add (juju add-machine ssh:ubuntu@${HOST}):

  1. Manupulate the netplan, adding a VRF
  2. Edit the jujud service file,
  3. Edit the sshd service file (both to run on the VRF)
  4. netplan apply
  5. systemctl daemon-reload and restart the jujud/sshd units <- this is where I hit the problem

As you might already suspect at this point, 5. causes an error:

  remove with:
  ssh-keygen -f "/home/ubuntu/.ssh/known_hosts" -R "10.10.32.24"
Keyboard-interactive authentication is disabled to avoid man-in-the-middle attacks.
UpdateHostkeys is disabled because the host key is not trusted.
client_loop: send disconnect: Connection reset by peer
ERROR provisioning failed, removing machine 8: subprocess encountered error code 255
ERROR error cleaning up machine: <nil>
ERROR subprocess encountered error code 255

Is there some way to prevent this with manual machines? This error is obviously expected, as the controller thinks the manual machine add failed.

Thanks for the help, Peter

Hi @pjds,

Could we please see the cloud-init script you are using?

Apologies @hpidcock got sidetrack with other elements, I will have to dig that out

Hey @hpidcock here’s the cloud init:

#cloud-config
write_files:
- path: /tmp/setup-vrfs.py
  permissions: '0744'
  owner: root:root
  content: |
    #!/bin/env python3
    import argparse
    import yaml
    import json
    import pathlib
    import ipaddress
    import re
    import subprocess


    def openfile_and_read(path):
        with open(path, "r") as fh:
            return fh.read()


    def main(args):
        netplan = {}

        if args.debug:
            netplan_configdir = pathlib.Path(
                "/tmp/netplan_configdir"
            )
        else:
            netplan_configdir = pathlib.Path("/etc/netplan")
        configs = list(netplan_configdir.iterdir())

        target_netplan = None
        if len(configs) > 1:
            print("WARNING: More than one netplan. Picking the first one and merging.")
        target_netplan = configs[0]

        netplan = yaml.safe_load(openfile_and_read(target_netplan))
        target_mgmt_cidr = ipaddress.IPv4Network(args.target_cidr)
        target_nic = None
        target_gateway = None
        vrf_name = "mgmt"

        for interface, nic_def in netplan["network"]["ethernets"].items():
            if (
                "addresses" in nic_def
                and ipaddress.IPv4Address(nic_def["addresses"][0].split("/")[0])
                in target_mgmt_cidr
            ):
                target_nic = interface

        routes = json.loads(
            subprocess.check_output(["ip", "-j", "route", "show", "default"]).decode()
        )
        if len(routes) > 1:
            print("WARNING: More than one route avaiable. Heuristic may fail.")
        target_gateway = routes[0]["gateway"]
        vrf = {
            "vrfs": {
                vrf_name: {
                    "table": 21,
                    "interfaces": [target_nic],
                    "routes": [
                        {
                            "to": "default",
                            "via": target_gateway,
                        }
                    ],
                    "routing-policy": [
                        {
                            "from": netplan["network"]["ethernets"][target_nic][
                                "addresses"
                            ][0],
                        }
                    ],
                }
            }
        }

        netplan['network'].update(vrf)

        jujud_svcfile = None
        if args.debug:
            systemd_filecollection = pathlib.Path(
                "/tmp/systemd_configd"
            )
        else:
            systemd_filecollection = pathlib.Path("/etc/systemd/system")

        # TODO: This whole section should be function calls.
        jujud_svcfile = list(
            filter(
                lambda path: re.match(r"jujud-machine-[0-9]{1,}\.service", path.name),
                systemd_filecollection.iterdir(),
            )
        )[0]

        sshd_svcfile = list(
            filter(
                lambda path: re.match(r"sshd.service", path.name),
                systemd_filecollection.iterdir(),
            )
        )[0]
        sshd_svcfile_content = openfile_and_read(sshd_svcfile)
        jujud_svcfile_content = openfile_and_read(jujud_svcfile)

        m = re.search(
            r"ExecStart=(?P<binary>/usr/sbin/sshd {1}-D {1}\$SSHD_OPTS)",
            sshd_svcfile_content,
            re.MULTILINE,
        )
        modified_sshd_svcfile = (
            sshd_svcfile_content[: m.start()]
            + f"ExecStart=/bin/ip vrf exec {vrf_name} {m.group('binary')}"
            + sshd_svcfile_content[m.end() :]
        )

        m = re.search(
            r"ExecStart=(?P<script>/etc/systemd/system/jujud-machine-[0-9]{1,}-exec-start.sh)",
            jujud_svcfile_content,
            re.MULTILINE,
        )
        if m == None:
            print(
                "WARNING: Juju not found, the script is probably running during MAAS setup, not juju setup. Exiting gracefully"
            )
            exit(0)

        modified_jujud_svcfile = (
            jujud_svcfile_content[: m.start()]
            + f"ExecStart=/bin/ip vrf exec {vrf_name} {m.group('script')}"
            + jujud_svcfile_content[m.end() :]
        )

        if args.debug:
            open("./test.netplan.yaml", "w").write(yaml.safe_dump(netplan))
        else:
            open(target_netplan, "w").write(yaml.safe_dump(netplan))
            subprocess.check_call(["sudo", "netplan", "apply"])

        if args.debug:
            open("./test.svcfile.jujud.service", "w").write(modified_jujud_svcfile)
            open("./test.svcfile.sshd.service", "w").write(modified_sshd_svcfile)
        else:
            open(sshd_svcfile, "w").write(modified_sshd_svcfile)
            open(jujud_svcfile, "w").write(modified_jujud_svcfile)

        subprocess.check_call("sudo systemctl daemon-reload".split())
        for service in ["sshd", "jujud-\*"]:
            subprocess.check_call(f"sudo systemctl restart {service}".split())


    parser = argparse.ArgumentParser("VRFerizer")

    parser.add_argument("--debug", action="store_true")
    parser.add_argument(
        "--cidr",
        dest="target_cidr",
        help="The CIDR to use when searching for NICs and routes.",
        required=True,
    )


    if __name__ == "__main__":
        args = parser.parse_args()
        main(args)


postruncmd:
- ['/tmp/setup-vrfs.py', '--cidr', '10.10.32.0/24']

Actually I found running this only at bootstrap juju won’t write out the script at the right time. So I needed to supply the cloud-init to MAAS first (so the script it written out) then pass the cloud-init again at bootstrap

I raised some issues with regards to this post:

Thanks :slight_smile: