Juju units die with "unexpected error: unknown object type "SecretsManager" (not implemented)"

hallback · 22 July 2022 13:17

Since last week this problem has been solved and my controllers have been working perfectly since then. The “SecretsManager” problem was easily solved as noted above, but the problem with charms not having a leader was something completely different.

DISCLAIMER: What I’m posting here is NOT a suggested solution for every similar situation, if you have this issue in a production system, get professional advice! I’m just a Juju user and I’m not from Canonical. But maybe it can be helpful to determine if you have the same problem. I don’t suggest this is how to fix it!

The problem could be identified like this:

$ while true ; do juju run --unit neutron-api/leader is-leader ; sleep 30 ; done
False
True
False
True
ERROR could not determine leader for "neutron-api"
False
True
False

$ while true ; do juju run --unit logrotated/leader is-leader ; sleep 30 ; done
False
False
False
False

I was instructed to check the debug log of the controller model. I have three machines there, machine-0, machine-1 and machine-2. The only machine making noise was machine-1 (Increasing the log level from WARNING to INFO also helped later on):

$ juju status -m controller
Model       Controller      Cloud/Region         Version  SLA          Timestamp
controller  lxd-controller  localhost/localhost  2.9.32   unsupported  12:44:36Z

Machine  State    DNS          Inst id        Series  AZ  Message
0        started  172.23.1.34  juju-e64711-0  bionic      Running
1        started  172.23.2.56  juju-e64711-1  bionic      Running
2        started  172.23.3.46  juju-e64711-2  bionic      Running

$ juju model-config -m controller logging-config='<root>=INFO;unit=INFO'

$ juju debug-log -m controller
machine-1: 14:32:01 WARNING juju.core.raftlease command Command(ver: 1, op: claim, ns: application-leadership, model: 9fe214, lease: influxdb, holder: influxdb/0): invalid lease operation
machine-1: 14:32:01 WARNING juju.core.raftlease command Command(ver: 1, op: claim, ns: application-leadership, model: 7561fb, lease: glance-simplestreams-sync, holder: glance-simplestreams-sync/0): invalid lease operation
machine-1: 14:32:15 WARNING juju.core.raftlease command Command(ver: 1, op: claim, ns: singular-controller, model: 7561fb, lease: 7561fb4a-0b8a-4278-8ccf-41c47a2281bf, holder: machine-1): lease already held
machine-1: 14:32:17 WARNING juju.core.raftlease command Command(ver: 1, op: claim, ns: singular-controller, model: 020f8d, lease: 020f8d98-a6e3-4222-8e46-d35c27e64711, holder: machine-1): lease already held
machine-1: 14:32:32 WARNING juju.core.raftlease command Command(ver: 1, op: claim, ns: singular-controller, model: 03aeec, lease: 03aeece3-344c-4dda-819f-5351359d5eaf, holder: machine-1): lease already held
machine-1: 14:32:58 ERROR juju.worker.dependency "is-responsible-flag" manifold worker returned unexpected error: model responsibility unclear, please retry
machine-1: 14:32:58 ERROR juju.worker.dependency "is-responsible-flag" manifold worker returned unexpected error: model responsibility unclear, please retry
machine-1: 14:32:58 ERROR juju.worker.dependency "is-responsible-flag" manifold worker returned unexpected error: model responsibility unclear, please retry
machine-1: 14:32:58 WARNING juju.core.raftlease command Command(ver: 1, op: claim, ns: application-leadership, model: 9fe214, lease: logrotated, holder: logrotated/72): lease already held
machine-1: 14:32:58 WARNING juju.core.raftlease command Command(ver: 1, op: claim, ns: application-leadership, model: 9fe214, lease: logrotated, holder: logrotated/70): lease already held

From here I logged in to every controller and ran the command juju_engine_report. This produces a lot of output, but the interesting parts were in the raft: part:

On machine-0, it is clear that 172.23.3.46:17070 (machine-2) is the leader and it considers itself follower:

  raft:
    inputs:
    - clock
    - agent
    - raft-transport
    - state
    - upgrade-steps-flag
    - upgrade-check-flag
    report:
      cluster-config:
        servers:
          "0":
            address: 172.23.1.34:17070
            suffrage: Voter
          "1":
            address: 172.23.2.56:17070
            suffrage: Voter
          "2":
            address: 172.23.3.46:17070
            suffrage: Voter
      index:
        applied: 292534506
        last: 292534506
      last-contact: now
      leader: 172.23.3.46:17070
      state: Follower
    start-count: 1
    started: "2022-07-12 14:16:04"
    state: started

On machine-1, raft doesn’t seem to be able to start:

  raft:
    inputs:
    - clock
    - agent
    - raft-transport
    - state
    - upgrade-steps-flag
    - upgrade-check-flag
    state: starting

On machine-2, it is clear that it considers itself leader.

  raft:
    inputs:
    - clock
    - agent
    - raft-transport
    - state
    - upgrade-steps-flag
    - upgrade-check-flag
    report:
      cluster-config:
        servers:
          "0":
            address: 172.23.1.34:17070
            suffrage: Voter
          "1":
            address: 172.23.2.56:17070
            suffrage: Voter
          "2":
            address: 172.23.3.46:17070
            suffrage: Voter
      index:
        applied: 292538871
        last: 292538871
      leader: 172.23.3.46:17070
      state: Leader
    start-count: 2
    started: "2022-07-12 14:13:19"
    state: started

I then figured out that I could use strings on the raft database to find out more (but this not a scientifically proven method!):

On machine-0, the IP of machine-2 is shown which seems reasonable from the previous output:

$ sudo strings /var/lib/juju/raft/logs | grep :17070
LastVoteCand172.23.3.46:17070LastVoteTerm
LastVoteCand172.23.3.46:17070LastVoteTerm
LastVoteCand172.23.3.46:17070LastVoteTerm

On machine-1 it shows its own address, but the other’s don’t consider this machine leader!

$ sudo strings /var/lib/juju/raft/logs | grep :17070
SLastVoteCand172.23.2.56:17070LastVoteTerm
SLastVoteCand172.23.2.56:17070LastVoteTerm
SLastVoteCand172.23.2.56:17070LastVoteTerm

On machine-2, it shows its own adress, which machine-0 and machine-2 agrees on:

LastVoteCand172.23.3.46:17070LastVoteTerm
LastVoteCand172.23.3.46:17070LastVoteTerm
LastVoteCand172.23.3.46:17070LastVoteTerm
LastVoteCand172.23.3.46:17070LastVoteTerm

To get out of this mess, I stopped juju-db.service and jujud-machine-N.service using systemctl stop <service> on all three controllers. Make sure you stop the machine service too and that all juju+mongodb processes remain gone - the machine service will respawn mongodb if it isn’t stopped.

Now that all controllers are down, I moved away the file /var/lib/juju/raft/logs on machine-0 and machine-1, but kept it as-is on machine-2. Since machine-2 was considered leader, I now restarted the Juju service on the controller. After a short while, I restarted the other two controllers. After this, all three controllers considered machine-2 leader and things worked again! juju debug-log -m controller shows no errors anymore.

My only theory on why this happened in the first place is that the LXD storage hosting machine-1 went low/out of space some months ago.

By the way, you do your controller backups regularly, don’t you? https://juju.is/docs/olm/controller-backups

Many thanks to Canonical Support, @jameinel and @manadart for very professional help.