Since last week this problem has been solved and my controllers have been working perfectly since then. The “SecretsManager” problem was easily solved as noted above, but the problem with charms not having a leader was something completely different.
DISCLAIMER: What I’m posting here is NOT a suggested solution for every similar situation, if you have this issue in a production system, get professional advice! I’m just a Juju user and I’m not from Canonical. But maybe it can be helpful to determine if you have the same problem. I don’t suggest this is how to fix it!
The problem could be identified like this:
$ while true ; do juju run --unit neutron-api/leader is-leader ; sleep 30 ; done
False
True
False
True
ERROR could not determine leader for "neutron-api"
False
True
False
$ while true ; do juju run --unit logrotated/leader is-leader ; sleep 30 ; done
False
False
False
False
I was instructed to check the debug log of the controller model. I have three machines there, machine-0, machine-1 and machine-2. The only machine making noise was machine-1 (Increasing the log level from WARNING to INFO also helped later on):
$ juju status -m controller
Model Controller Cloud/Region Version SLA Timestamp
controller lxd-controller localhost/localhost 2.9.32 unsupported 12:44:36Z
Machine State DNS Inst id Series AZ Message
0 started 172.23.1.34 juju-e64711-0 bionic Running
1 started 172.23.2.56 juju-e64711-1 bionic Running
2 started 172.23.3.46 juju-e64711-2 bionic Running
$ juju model-config -m controller logging-config='<root>=INFO;unit=INFO'
$ juju debug-log -m controller
machine-1: 14:32:01 WARNING juju.core.raftlease command Command(ver: 1, op: claim, ns: application-leadership, model: 9fe214, lease: influxdb, holder: influxdb/0): invalid lease operation
machine-1: 14:32:01 WARNING juju.core.raftlease command Command(ver: 1, op: claim, ns: application-leadership, model: 7561fb, lease: glance-simplestreams-sync, holder: glance-simplestreams-sync/0): invalid lease operation
machine-1: 14:32:15 WARNING juju.core.raftlease command Command(ver: 1, op: claim, ns: singular-controller, model: 7561fb, lease: 7561fb4a-0b8a-4278-8ccf-41c47a2281bf, holder: machine-1): lease already held
machine-1: 14:32:17 WARNING juju.core.raftlease command Command(ver: 1, op: claim, ns: singular-controller, model: 020f8d, lease: 020f8d98-a6e3-4222-8e46-d35c27e64711, holder: machine-1): lease already held
machine-1: 14:32:32 WARNING juju.core.raftlease command Command(ver: 1, op: claim, ns: singular-controller, model: 03aeec, lease: 03aeece3-344c-4dda-819f-5351359d5eaf, holder: machine-1): lease already held
machine-1: 14:32:58 ERROR juju.worker.dependency "is-responsible-flag" manifold worker returned unexpected error: model responsibility unclear, please retry
machine-1: 14:32:58 ERROR juju.worker.dependency "is-responsible-flag" manifold worker returned unexpected error: model responsibility unclear, please retry
machine-1: 14:32:58 ERROR juju.worker.dependency "is-responsible-flag" manifold worker returned unexpected error: model responsibility unclear, please retry
machine-1: 14:32:58 WARNING juju.core.raftlease command Command(ver: 1, op: claim, ns: application-leadership, model: 9fe214, lease: logrotated, holder: logrotated/72): lease already held
machine-1: 14:32:58 WARNING juju.core.raftlease command Command(ver: 1, op: claim, ns: application-leadership, model: 9fe214, lease: logrotated, holder: logrotated/70): lease already held
From here I logged in to every controller and ran the command juju_engine_report
. This produces a lot of output, but the interesting parts were in the raft:
part:
On machine-0, it is clear that 172.23.3.46:17070 (machine-2) is the leader and it considers itself follower:
raft:
inputs:
- clock
- agent
- raft-transport
- state
- upgrade-steps-flag
- upgrade-check-flag
report:
cluster-config:
servers:
"0":
address: 172.23.1.34:17070
suffrage: Voter
"1":
address: 172.23.2.56:17070
suffrage: Voter
"2":
address: 172.23.3.46:17070
suffrage: Voter
index:
applied: 292534506
last: 292534506
last-contact: now
leader: 172.23.3.46:17070
state: Follower
start-count: 1
started: "2022-07-12 14:16:04"
state: started
On machine-1, raft doesn’t seem to be able to start:
raft:
inputs:
- clock
- agent
- raft-transport
- state
- upgrade-steps-flag
- upgrade-check-flag
state: starting
On machine-2, it is clear that it considers itself leader.
raft:
inputs:
- clock
- agent
- raft-transport
- state
- upgrade-steps-flag
- upgrade-check-flag
report:
cluster-config:
servers:
"0":
address: 172.23.1.34:17070
suffrage: Voter
"1":
address: 172.23.2.56:17070
suffrage: Voter
"2":
address: 172.23.3.46:17070
suffrage: Voter
index:
applied: 292538871
last: 292538871
leader: 172.23.3.46:17070
state: Leader
start-count: 2
started: "2022-07-12 14:13:19"
state: started
I then figured out that I could use strings
on the raft database to find out more (but this not a scientifically proven method!):
On machine-0, the IP of machine-2 is shown which seems reasonable from the previous output:
$ sudo strings /var/lib/juju/raft/logs | grep :17070
LastVoteCand172.23.3.46:17070LastVoteTerm
LastVoteCand172.23.3.46:17070LastVoteTerm
LastVoteCand172.23.3.46:17070LastVoteTerm
On machine-1 it shows its own address, but the other’s don’t consider this machine leader!
$ sudo strings /var/lib/juju/raft/logs | grep :17070
SLastVoteCand172.23.2.56:17070LastVoteTerm
SLastVoteCand172.23.2.56:17070LastVoteTerm
SLastVoteCand172.23.2.56:17070LastVoteTerm
On machine-2, it shows its own adress, which machine-0 and machine-2 agrees on:
LastVoteCand172.23.3.46:17070LastVoteTerm
LastVoteCand172.23.3.46:17070LastVoteTerm
LastVoteCand172.23.3.46:17070LastVoteTerm
LastVoteCand172.23.3.46:17070LastVoteTerm
To get out of this mess, I stopped juju-db.service
and jujud-machine-N.service
using systemctl stop <service>
on all three controllers. Make sure you stop the machine service too and that all juju+mongodb processes remain gone - the machine service will respawn mongodb if it isn’t stopped.
Now that all controllers are down, I moved away the file /var/lib/juju/raft/logs
on machine-0 and machine-1, but kept it as-is on machine-2. Since machine-2 was considered leader, I now restarted the Juju service on the controller. After a short while, I restarted the other two controllers. After this, all three controllers considered machine-2 leader and things worked again! juju debug-log -m controller
shows no errors anymore.
My only theory on why this happened in the first place is that the LXD storage hosting machine-1 went low/out of space some months ago.
By the way, you do your controller backups regularly, don’t you? https://juju.is/docs/olm/controller-backups
Many thanks to Canonical Support, @jameinel and @manadart for very professional help.