Troubleshooting in the age of Dqlite

Here are some tips to diagnose issues in Juju 4.0+ with Dqlite. These steps could come in useful e.g. if bootstrap is failing due to a timeout making the initial connection to the API.

  1. When bootstrapping, use the --keep-broken flag. If the bootstrap fails, instead of cleaning up, Juju will keep the broken controller around, so you can access it for debugging purposes.

  2. To get a shell inside the broken controller:

    • Run lxc list to find the name of the container.
    • Get a console with lxc exec <container name> bash.
  3. Inside the controller, run

    # Install some needed tools
    sudo apt install -y go-dqlite
    sudo snap install yq
    # Create the .cert and .key files needed to connect to dqlite
    sudo cat /var/lib/juju/agents/machine-0/agent.conf | yq '.controllercert' | xargs -I% echo % > dqlite.cert
    sudo cat /var/lib/juju/agents/machine-0/agent.conf | yq '.controllerkey' | xargs -I% echo % > dqlite.key
    # Connect to dqlite
    sudo dqlite -s file:///var/lib/juju/dqlite/cluster.yaml -c ./dqlite.cert -k ./dqlite.key controller
    
  4. Root about in the controller DB as you please. You can for example list the tables that have been created:

     dqlite> SELECT name FROM sqlite_schema WHERE type ='table' AND name NOT LIKE 'sqlite_%';
    
  5. You can also use introspection tools to see the state of the worker tree. Inside the lxc exec session, run

    source /etc/profile.d/juju-introspection.sh
    juju_engine_report | less
    
1 Like

All credit goes to @manadart for the original post content.

1 Like

Some caveats:

  • This is for a single controller. In HA, Dqlite will be bound to a local cloud address rather than the loopback; and it uses TLS, so you would have to supply certificates to connect.
  • Using introspection does require the jujud to be running, because we need to connect to its introspection socket. In this case the API was down, but jujud itself was up.