How to debug bootstrap/machine failures

tmihoc · 8 August 2022 09:32

This guide will show you how to diagnose and fix issues with bootstrapping and starting new machines.

Juju’s bootstrapping process can be broken down into several steps:

Provision resources/a machine M from the relevant cloud
Install the Juju agent jujud on machine M
Poll the newly created instance for an IP address, and attempt to connect to M
Run the machine configuration script for M, which e.g. installs relevant packages

The output of juju bootstrap will tell you which step you’re at. If your failure is at step 1, then the issue is most likely with your cloud provider or configuration - see the guides here.

Otherwise, we need to connect to the machine and look at the logs to find out what’s gone wrong.

Connect to the machine

Via ssh

The easiest way to connect to the machine is via ssh. We can do this if Juju has been successfully able to connect to your controller. In this case, you will see the line


Connected to [ip-address]

in your juju bootstrap output.

A common type of failure here is when the terminal hangs on the line


Running machine configuration script...

The machine configuration should take less than 10 minutes - any longer than this is a sign that something has gone wrong.

Luckily, the machine is already reachable at this step, so we can directly ssh into it to find out what’s happening. Copy the IP address that Juju connected to above, and run


ssh ubuntu@[ip-address] -i [juju-data-dir]/ssh/juju_id_rsa

Here, [juju-data-dir] defaults to ~/.local/share/juju, but if you’ve set the JUJU_DATA environment variable, it will be equal to that instead.

See here for a more in-depth guide on using SSH to connect to a machine.

Via the cloud provider

If Juju wasn’t able to connect to your machine’s IP address, then ssh probably won’t be able to either. With this type of failure, you’ll often see your terminal hang after the step


Attempting to connect to [ip-address]:[port]

In this case, we will need to go through the cloud provider to connect to the machine. The process here depends on what cloud you’re using.

LXC / LXD

In the juju bootstrap output, you should see a line like


Launching controller instance(s) on localhost/localhost...

which will be followed by the LXD container name (in the form juju-XXXXXX-0). We can use the lxc command line tool to get a shell inside the machine. Copy the container name, then run


lxc exec [container-name] bash

Now, we should have a shell inside the machine, and can use the steps below to search the logs.

Kubernetes

In the juju bootstrap output, you should see a line like


Creating k8s resources for controller [namespace]

where [namespace] is something like controller-foobar. Inside this namespace, Juju will have created a pod called controller-0 - we want to access the api-server container in this pod. To do this, we use kubectl:


kubectl exec controller-0 -itc api-server -n [namespace] -- bash

(If using MicroK8s, call this command via microk8s kubectl).

Examine the logs

Once we have a shell inside the machine, we can


ls /var/log

which will show you all the available logs. Which log to look at depends on the type of failure, but generally speaking, syslog, cloud-init.log and cloud-init-output.log are good ones to look at.

Some good tools for examining logs are


less [log-file]

which will let you scroll through the log file, and


tail -f [log-file]

which will track updates to the log file.

Errors (especially fatal ones) will often be near the end of a log file. You may also have luck searching your logs for phrases such as “error” or “fail”.

pedroleaoc · 14 October 2022 11:25

pedroleaoc · 14 October 2022 11:30