Juju units die with "unexpected error: unknown object type "SecretsManager" (not implemented)"

Dear community,

Two days ago I upgraded our production controller from 2.9.28 to 2.9.32.

Since then, charms have leadership issues or won’t start. An example here is mysql/0, a single unit application, directlly after start:

unit-mysql-0: 09:27:21 INFO juju Starting unit workers for "mysql/0"
unit-mysql-0: 09:27:21 INFO juju.agent.setup setting logging config to "<root>=INFO;unit=INFO;"
unit-mysql-0: 09:27:21 INFO juju.worker.apicaller [9fe214] "unit-mysql-0" successfully connected to "172.23.1.34:17070"
unit-mysql-0: 09:27:21 INFO juju.worker.migrationminion migration phase is now: NONE
unit-mysql-0: 09:27:21 INFO juju.worker.logger logger worker started
unit-mysql-0: 09:27:21 INFO juju.worker.upgrader no waiter, upgrader is done
machine-1-lxd-4: 09:27:21 INFO juju.agent.tools ensure jujuc symlinks in /var/lib/juju/tools/unit-mysql-0
unit-mysql-0: 09:27:21 INFO juju.worker.uniter unit "mysql/0" shutting down: failed to initialize uniter for "unit-mysql-0": cannot create relation state track
er: "mysql/0" is not leader of "mysql"
unit-mysql-0: 09:27:21 ERROR juju.worker.dependency "uniter" manifold worker returned unexpected error: failed to initialize uniter for "unit-mysql-0": cannot
create relation state tracker: "mysql/0" is not leader of "mysql"
machine-1-lxd-4: 09:27:25 INFO juju.agent.tools ensure jujuc symlinks in /var/lib/juju/tools/unit-mysql-0
unit-mysql-0: 09:27:25 INFO juju.worker.uniter unit "mysql/0" started
unit-mysql-0: 09:27:25 INFO juju.worker.uniter hooks are retried false
unit-mysql-0: 09:27:25 INFO juju.worker.uniter unit "mysql/0" shutting down: catacomb 0xc0006c5200 is dying
unit-mysql-0: 09:27:25 ERROR juju.worker.dependency "uniter" manifold worker returned unexpected error: unknown object type "SecretsManager" (not implemented)
machine-1-lxd-4: 09:27:29 INFO juju.agent.tools ensure jujuc symlinks in /var/lib/juju/tools/unit-mysql-0
unit-mysql-0: 09:27:30 INFO juju.worker.uniter unit "mysql/0" started
unit-mysql-0: 09:27:30 INFO juju.worker.uniter hooks are retried false
unit-mysql-0: 09:27:30 INFO juju.worker.uniter unit "mysql/0" shutting down: catacomb 0xc00172a480 is dying
unit-mysql-0: 09:27:30 ERROR juju.worker.dependency "uniter" manifold worker returned unexpected error: unknown object type "SecretsManager" (not implemented)
machine-1-lxd-4: 09:27:35 INFO juju.agent.tools ensure jujuc symlinks in /var/lib/juju/tools/unit-mysql-0
unit-mysql-0: 09:27:35 INFO juju.worker.uniter unit "mysql/0" started
unit-mysql-0: 09:27:35 INFO juju.worker.uniter hooks are retried false
unit-mysql-0: 09:27:35 INFO juju.worker.uniter unit "mysql/0" shutting down: catacomb 0xc0001a0d80 is dying
unit-mysql-0: 09:27:35 ERROR juju.worker.dependency "uniter" manifold worker returned unexpected error: unknown object type "SecretsManager" (not implemented)

That “catacomb is dying” and then restart repeats forever every minute after that.

Another example, from another single-unit charm:

unit-glance-0: 10:01:34 INFO juju.worker.uniter unit "glance/0" started
unit-glance-0: 10:01:34 INFO juju.worker.uniter hooks are retried false
unit-glance-0: 10:01:34 WARNING juju.worker.uniter.operation we should run a leader-deposed hook here, but we can't yet
unit-glance-0: 10:01:34 INFO juju.worker.uniter found queued "leader-elected" hook
unit-glance-0: 10:01:34 INFO juju.worker.uniter.operation unit is no longer the leader; skipping "leader-elected" execution
unit-glance-0: 10:01:34 INFO juju.worker.uniter.operation skipped "leader-settings-changed" hook (missing)
unit-glance-0: 10:02:32 INFO juju.worker.uniter found queued "leader-elected" hook
unit-glance-0: 10:02:32 WARNING juju.worker.uniter.resolver executor lock acquisition cancelled
unit-glance-0: 10:02:32 INFO juju.worker.uniter unit "glance/0" shutting down: (re)starting watcher: catacomb 0xc000ce1200 is dying
unit-glance-0: 10:02:32 ERROR juju.worker.dependency "uniter" manifold worker returned unexpected error: unknown object type "SecretsManager" (not implemented)

Some units (but far from all) seem to behave better after shutting down the agent for more than 60 seconds (a trick I found in some older post that has helped me before), so basically doing this on the problematic unit:

# systemctl stop jujud-machine-1-lxd-4.service ; sleep 70 ; systemctl start jujud-machine-1-lxd-4.service

I have not yet upgraded our two OpenStack juju models from 2.9.28 to 2.9.32 since the system behaves so weird.

What could be related: The controller didn’t go as smooth as usual, on one of the machines the upgrade retried for 20 minutes or so, but in the end the model reports the new version, and the controllers have restarted their processes, and the version numbers have been bumped. Then suddenly all was ok:

Model       Controller      Cloud/Region         Version  SLA          Timestamp
controller  lxd-controller  localhost/localhost  2.9.32   unsupported  07:36:59Z

Machine  State    DNS          Inst id        Series  AZ  Message
0        started  172.23.1.34  juju-e64711-0  bionic      Running
1        started  172.23.2.56  juju-e64711-1  bionic      Running
2        started  172.23.3.46  juju-e64711-2  bionic      Running

We have a proper Juju backup (from juju create-backup), but I’m hoping there’s a way to get fix this without such drastic measures.

Thanks in advance!

1 Like

@erik-lonroth could confirm that he also gets this error message after upgrading to 2.9.32, but without the leadership problems that I have. So I guess the main problem is that the leadership functionality is broken somehow.

An example here is ceph-mon/0, which seems to have a problem being elected to leader, then crashes, then restarts:

unit-ceph-mon-0: 09:17:26 INFO unit.ceph-mon/0.juju-log Updating status.
unit-ceph-mon-0: 09:17:28 INFO juju.worker.uniter.operation ran "update-status" hook (via explicit, bespoke hook script)
unit-ceph-mon-0: 09:33:21 INFO unit.ceph-mon/0.juju-log Updating status.
unit-ceph-mon-0: 09:33:22 INFO juju.worker.uniter.operation ran "update-status" hook (via explicit, bespoke hook script)
unit-ceph-mon-0: 09:36:38 INFO juju.worker.uniter found queued "leader-elected" hook
unit-ceph-mon-0: 09:36:38 WARNING juju.worker.uniter.resolver executor lock acquisition cancelled
unit-ceph-mon-0: 09:36:38 INFO juju.worker.uniter unit "ceph-mon/0" shutting down: (re)starting watcher: catacomb 0xc001f56900 is dying
unit-ceph-mon-0: 09:36:38 ERROR juju.worker.dependency "uniter" manifold worker returned unexpected error: unknown object type "SecretsManager" (not implemented)
unit-ceph-mon-0: 09:36:41 INFO juju.worker.uniter unit "ceph-mon/0" shutting down: failed to initialize uniter for "unit-ceph-mon-0": cannot create relation state tracker: "ceph-mon/0" is not leader of "ceph-mon"
unit-ceph-mon-0: 09:36:41 ERROR juju.worker.dependency "uniter" manifold worker returned unexpected error: failed to initialize uniter for "unit-ceph-mon-0": cannot create relation state tracker: "ceph-mon/0" is not leader of "ceph-mon"
unit-ceph-mon-0: 09:36:44 INFO juju.worker.uniter unit "ceph-mon/0" shutting down: failed to initialize uniter for "unit-ceph-mon-0": cannot create relation state tracker: "ceph-mon/0" is not leader of "ceph-mon"
unit-ceph-mon-0: 09:36:44 ERROR juju.worker.dependency "uniter" manifold worker returned unexpected error: failed to initialize uniter for "unit-ceph-mon-0": cannot create relation state tracker: "ceph-mon/0" is not leader of "ceph-mon"
unit-ceph-mon-0: 09:36:48 INFO juju.worker.uniter unit "ceph-mon/0" shutting down: failed to initialize uniter for "unit-ceph-mon-0": cannot create relation state tracker: "ceph-mon/0" is not leader of "ceph-mon"
unit-ceph-mon-0: 09:36:48 ERROR juju.worker.dependency "uniter" manifold worker returned unexpected error: failed to initialize uniter for "unit-ceph-mon-0": cannot create relation state tracker: "ceph-mon/0" is not leader of "ceph-mon"
unit-ceph-mon-0: 09:36:53 INFO juju.worker.uniter unit "ceph-mon/0" shutting down: failed to initialize uniter for "unit-ceph-mon-0": cannot create relation state tracker: "ceph-mon/0" is not leader of "ceph-mon"
unit-ceph-mon-0: 09:36:53 ERROR juju.worker.dependency "uniter" manifold worker returned unexpected error: failed to initialize uniter for "unit-ceph-mon-0": cannot create relation state tracker: "ceph-mon/0" is not leader of "ceph-mon"
unit-ceph-mon-0: 09:37:00 INFO juju.worker.uniter unit "ceph-mon/0" shutting down: failed to initialize uniter for "unit-ceph-mon-0": cannot create relation state tracker: "ceph-mon/0" is not leader of "ceph-mon"
unit-ceph-mon-0: 09:37:00 ERROR juju.worker.dependency "uniter" manifold worker returned unexpected error: failed to initialize uniter for "unit-ceph-mon-0": cannot create relation state tracker: "ceph-mon/0" is not leader of "ceph-mon"
unit-ceph-mon-0: 09:37:07 INFO juju.worker.uniter unit "ceph-mon/0" shutting down: failed to initialize uniter for "unit-ceph-mon-0": cannot create relation state tracker: "ceph-mon/0" is not leader of "ceph-mon"
unit-ceph-mon-0: 09:37:07 ERROR juju.worker.dependency "uniter" manifold worker returned unexpected error: failed to initialize uniter for "unit-ceph-mon-0": cannot create relation state tracker: "ceph-mon/0" is not leader of "ceph-mon"
unit-ceph-mon-0: 09:37:17 INFO juju.worker.uniter unit "ceph-mon/0" started

If I find a solution, I’ll post it here.

The SecretsManager seems like something that would be a model version vs controller version incompatibility. (We shouldn’t have those, as they should stay compatible, but that is the sort of mistake that could be made.)

“not the leader” is reasonable, if there is another unit that could be leader, but if there is only one unit that does seem like a problem. And cannot create relation state tracker seems a bigger issue, as that seems like the Uniter thinks it is the leader, but then the controller turns around and says that it isn’t.

For the former, it seems like juju upgrade-model should fix it. But it would be good to submit 2 bugs. One related to SecretsManager incompatibility and the other for cannot create relation state tracker.

I submitted:

and:

The former seems like something that we could investigate on our own, the latter feels like we need a bit more information from you. Are there other things failing (eg anything about leadership itself not initializing). How many units are there, etc.

Can you also clarify what you mean by “charms … won’t start” ?

Hi @jameinel, thanks for your response!

You were right, it turned out that upgrading the models as well from 2.9.28 to 2.9.32 got rid of that SecretsManager / catacomb is dying problem. The only reason I didn’t move on to upgrade the models as well was due to the controller upgrade behavning strange (took a long time). With everything in a questionable state I was afraid model upgrades could make thing even worse.

Now I can also query leadership on all units, although running juju run --unit charmname/leader will switch back and forth between True and False.

What I meant with “won’t start” was that many units seemed to not come anywhere and just die directly, as with the mysql/0 above (the first log excerpt), where this sequence repeated every five seconds:

unit-mysql-0: 09:27:30 INFO juju.worker.uniter unit "mysql/0" started
unit-mysql-0: 09:27:30 INFO juju.worker.uniter hooks are retried false
unit-mysql-0: 09:27:30 INFO juju.worker.uniter unit "mysql/0" shutting down: catacomb 0xc00172a480 is dying
unit-mysql-0: 09:27:30 ERROR juju.worker.dependency "uniter" manifold worker returned unexpected error: unknown object type "SecretsManager" (not implemented)
machine-1-lxd-4: 09:27:35 INFO juju.agent.tools ensure jujuc symlinks in /var/lib/juju/tools/unit-mysql-0

… but this has been fixed by upgrading the model.

On units that still report leadership issues, some have been possible to solve by logging in to the unit and restarting the agent, with enough time in between (I read in some bug report that it could help)

$ juju ssh nova-api/0
$ sudo systemctl stop jujud-machine-1-lxd-10.service ; sleep 90 ; sudo systemctl start jujud-machine-1-lxd-10.service

This has so far helped in some cases but not all. I have two models, and after a lot of agent restarts I’m down to only 2 units of 656 and 4 units of 278 that are having leadership issues now, so the situation is so much better now. I have not seen any other issues when digging through the logs. I’m also in contact with Canonical support on this matter since this morning.

I find the units with leadership issues in the logs like this, so for the current hour:

juju debug-log --replay | grep ": 18:" | grep 'juju.worker.uniter.operation we should run a leader-deposed hook here' | sort
unit-grafana-1: 18:00:50 WARNING juju.worker.uniter.operation we should run a leader-deposed hook here, but we can't yet
unit-grafana-1: 18:01:52 WARNING juju.worker.uniter.operation we should run a leader-deposed hook here, but we can't yet
unit-grafana-1: 18:02:54 WARNING juju.worker.uniter.operation we should run a leader-deposed hook here, but we can't yet
unit-grafana-1: 18:03:53 WARNING juju.worker.uniter.operation we should run a leader-deposed hook here, but we can't yet
unit-grafana-1: 18:04:56 WARNING juju.worker.uniter.operation we should run a leader-deposed hook here, but we can't yet
unit-grafana-1: 18:05:56 WARNING juju.worker.uniter.operation we should run a leader-deposed hook here, but we can't yet
unit-grafana-1: 18:06:57 WARNING juju.worker.uniter.operation we should run a leader-deposed hook here, but we can't yet
unit-grafana-1: 18:07:58 WARNING juju.worker.uniter.operation we should run a leader-deposed hook here, but we can't yet
unit-grafana-1: 18:08:59 WARNING juju.worker.uniter.operation we should run a leader-deposed hook here, but we can't yet
unit-grafana-1: 18:10:00 WARNING juju.worker.uniter.operation we should run a leader-deposed hook here, but we can't yet
unit-grafana-1: 18:11:01 WARNING juju.worker.uniter.operation we should run a leader-deposed hook here, but we can't yet
unit-neutron-api-0: 18:00:51 WARNING juju.worker.uniter.operation we should run a leader-deposed hook here, but we can't yet
unit-neutron-api-0: 18:01:53 WARNING juju.worker.uniter.operation we should run a leader-deposed hook here, but we can't yet
unit-neutron-api-0: 18:02:53 WARNING juju.worker.uniter.operation we should run a leader-deposed hook here, but we can't yet
unit-neutron-api-0: 18:03:53 WARNING juju.worker.uniter.operation we should run a leader-deposed hook here, but we can't yet
unit-neutron-api-0: 18:04:54 WARNING juju.worker.uniter.operation we should run a leader-deposed hook here, but we can't yet
unit-neutron-api-0: 18:05:56 WARNING juju.worker.uniter.operation we should run a leader-deposed hook here, but we can't yet
unit-neutron-api-0: 18:06:57 WARNING juju.worker.uniter.operation we should run a leader-deposed hook here, but we can't yet
unit-neutron-api-0: 18:07:58 WARNING juju.worker.uniter.operation we should run a leader-deposed hook here, but we can't yet
unit-neutron-api-0: 18:09:00 WARNING juju.worker.uniter.operation we should run a leader-deposed hook here, but we can't yet
unit-neutron-api-0: 18:10:00 WARNING juju.worker.uniter.operation we should run a leader-deposed hook here, but we can't yet
unit-neutron-api-0: 18:11:02 WARNING juju.worker.uniter.operation we should run a leader-deposed hook here, but we can't yet

I have only one unit of these two charms.

In this situation, the leadership when queried every 30 seconds looks like this:

$ while true ; do juju run --unit neutron-api/0 is-leader ; sleep 30 ; done
False
True
False
True
False

… and so on.

In the logs, this happens during this time as the leadership is “flapping” (I also include the machine here):

$ juju debug-log --replay --include unit-neutron-api-0 --include machine-2-lxd-3
machine-2-lxd-3: 18:05:56 INFO juju.worker.leadership neutron-api leadership for neutron-api/0 denied
unit-neutron-api-0: 18:05:56 WARNING juju.worker.uniter.operation we should run a leader-deposed hook here, but we can't yet
machine-2-lxd-3: 18:06:24 INFO juju.worker.leadership logrotated leadership for logrotated/30 denied
machine-2-lxd-3: 18:06:24 INFO juju.worker.leadership neutron-api/0 promoted to leadership of neutron-api
unit-neutron-api-0: 18:06:24 INFO juju.worker.uniter found queued "leader-elected" hook
unit-neutron-api-0: 18:06:24 INFO juju.worker.uniter.operation skipped "leader-elected" hook (missing)
machine-2-lxd-3: 18:06:57 INFO juju.worker.leadership neutron-api leadership for neutron-api/0 denied
unit-neutron-api-0: 18:06:57 WARNING juju.worker.uniter.operation we should run a leader-deposed hook here, but we can't yet
machine-2-lxd-3: 18:07:25 INFO juju.worker.leadership logrotated leadership for logrotated/30 denied
machine-2-lxd-3: 18:07:25 INFO juju.worker.leadership neutron-api/0 promoted to leadership of neutron-api
unit-neutron-api-0: 18:07:25 INFO juju.worker.uniter found queued "leader-elected" hook
unit-neutron-api-0: 18:07:25 INFO juju.worker.uniter.operation skipped "leader-elected" hook (missing)
machine-2-lxd-3: 18:07:58 INFO juju.worker.leadership neutron-api leadership for neutron-api/0 denied
unit-neutron-api-0: 18:07:58 WARNING juju.worker.uniter.operation we should run a leader-deposed hook here, but we can't yet

During the startup of the unit, I have the errors about the relation state tracker that can’t be initialized:

machine-2-lxd-3: 18:23:41 INFO juju.cmd running jujud [2.9.32 917a8f1033561ce28a73ff81d71da75aec6e0785 gc go1.18.3]
machine-2-lxd-3: 18:23:41 DEBUG juju.cmd   args: []string{"/var/lib/juju/tools/machine-2-lxd-3/jujud", "machine", "--data-dir", "/var/lib/juju", "--machine-id", "2/lxd/3", "--debug"}
machine-2-lxd-3: 18:23:41 DEBUG juju.utils setting GOMAXPROCS to 48
machine-2-lxd-3: 18:23:41 DEBUG juju.agent read agent config, format "2.0"
machine-2-lxd-3: 18:23:41 INFO juju.agent.setup setting logging config to "<root>=INFO;unit=INFO;"
machine-2-lxd-3: 18:23:41 INFO juju.worker.upgradesteps upgrade steps for 2.9.32 have already been run.
machine-2-lxd-3: 18:23:41 INFO juju.api connection established to "wss://172.23.1.34:17070/model/9fe21480-f7f9-4f35-8d65-17445e88a385/api"
machine-2-lxd-3: 18:23:41 INFO juju.worker.apicaller [9fe214] "machine-2-lxd-3" successfully connected to "172.23.1.34:17070"
machine-2-lxd-3: 18:23:41 INFO juju.worker.upgrader desired agent binary version: 2.9.32
machine-2-lxd-3: 18:23:41 INFO juju.worker.deployer new context: units "nrpe-container/27, telegraf/70, filebeat/22, logrotated/30, neutron-api/0", stopped ""
machine-2-lxd-3: 18:23:41 INFO juju.worker.deployer creating new agent config for "filebeat/22"
machine-2-lxd-3: 18:23:41 INFO juju.worker.deployer starting workers for "filebeat/22"
machine-2-lxd-3: 18:23:41 INFO juju.worker.deployer creating new agent config for "logrotated/30"
machine-2-lxd-3: 18:23:41 INFO juju.worker.deployer starting workers for "logrotated/30"
machine-2-lxd-3: 18:23:41 INFO juju.worker.deployer creating new agent config for "neutron-api/0"
machine-2-lxd-3: 18:23:41 INFO juju.worker.deployer starting workers for "neutron-api/0"
machine-2-lxd-3: 18:23:41 INFO juju.worker.deployer creating new agent config for "nrpe-container/27"
unit-neutron-api-0: 18:23:41 INFO juju Starting unit workers for "neutron-api/0"
unit-neutron-api-0: 18:23:41 INFO juju.agent.setup setting logging config to "<root>=INFO;unit=INFO;"
machine-2-lxd-3: 18:23:41 INFO juju.worker.deployer starting workers for "nrpe-container/27"
machine-2-lxd-3: 18:23:41 INFO juju.worker.deployer creating new agent config for "telegraf/70"
machine-2-lxd-3: 18:23:41 INFO juju.worker.deployer starting workers for "telegraf/70"
machine-2-lxd-3: 18:23:41 INFO juju.api connection established to "wss://172.23.1.34:17070/model/9fe21480-f7f9-4f35-8d65-17445e88a385/api"
unit-neutron-api-0: 18:23:41 INFO juju.worker.apicaller [9fe214] "unit-neutron-api-0" successfully connected to "172.23.1.34:17070"
machine-2-lxd-3: 18:23:41 INFO juju.worker.deployer checking unit "filebeat/22"
machine-2-lxd-3: 18:23:41 INFO juju.worker.deployer checking unit "logrotated/30"
machine-2-lxd-3: 18:23:41 INFO juju.api connection established to "wss://172.23.3.46:17070/model/9fe21480-f7f9-4f35-8d65-17445e88a385/api"
machine-2-lxd-3: 18:23:41 INFO juju.worker.migrationminion migration phase is now: NONE
machine-2-lxd-3: 18:23:41 INFO juju.api connection established to "wss://172.23.1.34:17070/model/9fe21480-f7f9-4f35-8d65-17445e88a385/api"
machine-2-lxd-3: 18:23:41 INFO juju.worker.logger logger worker started
machine-2-lxd-3: 18:23:41 INFO juju.worker.deployer checking unit "neutron-api/0"
machine-2-lxd-3: 18:23:41 INFO juju.api connection established to "wss://172.23.2.56:17070/model/9fe21480-f7f9-4f35-8d65-17445e88a385/api"
machine-2-lxd-3: 18:23:41 INFO juju.worker.deployer checking unit "nrpe-container/27"
machine-2-lxd-3: 18:23:41 INFO juju.api connection established to "wss://172.23.1.34:17070/model/9fe21480-f7f9-4f35-8d65-17445e88a385/api"
machine-2-lxd-3: 18:23:41 INFO juju.worker.deployer checking unit "telegraf/70"
machine-2-lxd-3: 18:23:41 WARNING juju.cmd.jujud determining kvm support: INFO: /dev/kvm does not exist
HINT:   sudo modprobe kvm_intel
modprobe: FATAL: Module msr not found in directory /lib/modules/5.4.0-84-generic
: exit status 1
no kvm containers possible
machine-2-lxd-3: 18:23:41 INFO juju.worker.authenticationworker "machine-2-lxd-3" key updater worker started
machine-2-lxd-3: 18:23:41 INFO juju.api connection established to "wss://172.23.2.56:17070/model/9fe21480-f7f9-4f35-8d65-17445e88a385/api"
unit-neutron-api-0: 18:23:41 INFO juju.worker.apicaller [9fe214] "unit-neutron-api-0" successfully connected to "172.23.2.56:17070"
machine-2-lxd-3: 18:23:41 INFO juju.worker.leadership nrpe-container leadership for nrpe-container/27 denied
machine-2-lxd-3: 18:23:41 INFO juju.worker.leadership telegraf leadership for telegraf/70 denied
machine-2-lxd-3: 18:23:41 INFO juju.worker.leadership logrotated leadership for logrotated/30 denied
machine-2-lxd-3: 18:23:41 INFO juju.agent.tools ensure jujuc symlinks in /var/lib/juju/tools/unit-nrpe-container-27
machine-2-lxd-3: 18:23:41 INFO juju.agent.tools was a symlink, now looking at /var/lib/juju/tools/2.9.32-ubuntu-amd64
unit-neutron-api-0: 18:23:41 INFO juju.worker.migrationminion migration phase is now: NONE
unit-neutron-api-0: 18:23:41 INFO juju.worker.logger logger worker started
machine-2-lxd-3: 18:23:41 INFO juju.worker.upgradeseries no series upgrade lock present
machine-2-lxd-3: 18:23:41 INFO juju.agent.tools ensure jujuc symlinks in /var/lib/juju/tools/unit-logrotated-30
machine-2-lxd-3: 18:23:41 INFO juju.agent.tools was a symlink, now looking at /var/lib/juju/tools/2.9.32-ubuntu-amd64
machine-2-lxd-3: 18:23:41 INFO juju.worker.deployer checking unit "neutron-api/0"
machine-2-lxd-3: 18:23:41 INFO juju.agent.tools ensure jujuc symlinks in /var/lib/juju/tools/unit-telegraf-70
machine-2-lxd-3: 18:23:41 INFO juju.agent.tools was a symlink, now looking at /var/lib/juju/tools/2.9.32-ubuntu-amd64
machine-2-lxd-3: 18:23:41 INFO juju.worker.leadership filebeat leadership for filebeat/22 denied
unit-neutron-api-0: 18:23:41 INFO juju.worker.upgrader no waiter, upgrader is done
machine-2-lxd-3: 18:23:41 INFO juju.agent.tools ensure jujuc symlinks in /var/lib/juju/tools/unit-filebeat-22
machine-2-lxd-3: 18:23:41 INFO juju.agent.tools was a symlink, now looking at /var/lib/juju/tools/2.9.32-ubuntu-amd64
machine-2-lxd-3: 18:23:41 INFO juju.agent.tools ensure jujuc symlinks in /var/lib/juju/tools/unit-neutron-api-0
machine-2-lxd-3: 18:23:41 INFO juju.agent.tools was a symlink, now looking at /var/lib/juju/tools/2.9.32-ubuntu-amd64
machine-2-lxd-3: 18:23:42 INFO juju.worker.leadership neutron-api/0 promoted to leadership of neutron-api
machine-2-lxd-3: 18:23:42 INFO juju.worker.machiner setting addresses for "machine-2-lxd-3" to [local-machine:127.0.0.1 local-cloud:10.20.30.26 local-cloud:172.23.3.117 local-machine:::1]
machine-2-lxd-3: 18:23:42 INFO juju.cmd.jujud.runner start "2/lxd/3-container-watcher"
machine-2-lxd-3: 18:23:42 INFO juju.worker.deployer checking unit "filebeat/22"
machine-2-lxd-3: 18:23:42 INFO juju.worker.deployer checking unit "logrotated/30"
machine-2-lxd-3: 18:23:42 INFO juju.worker.deployer checking unit "nrpe-container/27"
machine-2-lxd-3: 18:23:42 INFO juju.worker.deployer checking unit "telegraf/70"
machine-2-lxd-3: 18:23:42 INFO juju.worker.machiner "machine-2-lxd-3" started
unit-neutron-api-0: 18:23:42 INFO juju.worker.uniter unit "neutron-api/0" shutting down: failed to initialize uniter for "unit-neutron-api-0": cannot create relation state tracker: "neutron-api/0" is not leader of "neutron-api"
unit-neutron-api-0: 18:23:42 ERROR juju.worker.dependency "uniter" manifold worker returned unexpected error: failed to initialize uniter for "unit-neutron-api-0": cannot create relation state tracker: "neutron-api/0" is not leader of "neutron-api"
machine-2-lxd-3: 18:23:42 INFO juju.cmd.jujud.runner start "stateconverter"
machine-2-lxd-3: 18:23:43 INFO juju.worker.leadership filebeat leadership for filebeat/22 denied
machine-2-lxd-3: 18:23:45 INFO juju.agent.tools ensure jujuc symlinks in /var/lib/juju/tools/unit-neutron-api-0
machine-2-lxd-3: 18:23:45 INFO juju.agent.tools was a symlink, now looking at /var/lib/juju/tools/2.9.32-ubuntu-amd64
unit-neutron-api-0: 18:23:46 INFO juju.worker.uniter unit "neutron-api/0" shutting down: failed to initialize uniter for "unit-neutron-api-0": cannot create relation state tracker: "neutron-api/0" is not leader of "neutron-api"
unit-neutron-api-0: 18:23:46 ERROR juju.worker.dependency "uniter" manifold worker returned unexpected error: failed to initialize uniter for "unit-neutron-api-0": cannot create relation state tracker: "neutron-api/0" is not leader of "neutron-api"
machine-2-lxd-3: 18:23:47 INFO juju.worker.leadership filebeat leadership for filebeat/22 denied
machine-2-lxd-3: 18:23:50 INFO juju.agent.tools ensure jujuc symlinks in /var/lib/juju/tools/unit-neutron-api-0
machine-2-lxd-3: 18:23:50 INFO juju.agent.tools was a symlink, now looking at /var/lib/juju/tools/2.9.32-ubuntu-amd64
unit-neutron-api-0: 18:23:50 INFO juju.worker.uniter unit "neutron-api/0" shutting down: failed to initialize uniter for "unit-neutron-api-0": cannot create relation state tracker: "neutron-api/0" is not leader of "neutron-api"
unit-neutron-api-0: 18:23:50 ERROR juju.worker.dependency "uniter" manifold worker returned unexpected error: failed to initialize uniter for "unit-neutron-api-0": cannot create relation state tracker: "neutron-api/0" is not leader of "neutron-api"
machine-2-lxd-3: 18:23:51 INFO juju.worker.leadership filebeat leadership for filebeat/22 denied
machine-2-lxd-3: 18:23:55 INFO juju.agent.tools ensure jujuc symlinks in /var/lib/juju/tools/unit-neutron-api-0
machine-2-lxd-3: 18:23:55 INFO juju.agent.tools was a symlink, now looking at /var/lib/juju/tools/2.9.32-ubuntu-amd64
unit-neutron-api-0: 18:23:55 INFO juju.worker.uniter unit "neutron-api/0" shutting down: failed to initialize uniter for "unit-neutron-api-0": cannot create relation state tracker: "neutron-api/0" is not leader of "neutron-api"
unit-neutron-api-0: 18:23:55 ERROR juju.worker.dependency "uniter" manifold worker returned unexpected error: failed to initialize uniter for "unit-neutron-api-0": cannot create relation state tracker: "neutron-api/0" is not leader of "neutron-api"
machine-2-lxd-3: 18:23:56 INFO juju.worker.leadership filebeat leadership for filebeat/22 denied
machine-2-lxd-3: 18:24:02 INFO juju.agent.tools ensure jujuc symlinks in /var/lib/juju/tools/unit-neutron-api-0
machine-2-lxd-3: 18:24:02 INFO juju.agent.tools was a symlink, now looking at /var/lib/juju/tools/2.9.32-ubuntu-amd64
unit-neutron-api-0: 18:24:02 INFO juju.worker.uniter unit "neutron-api/0" shutting down: failed to initialize uniter for "unit-neutron-api-0": cannot create relation state tracker: "neutron-api/0" is not leader of "neutron-api"
unit-neutron-api-0: 18:24:02 ERROR juju.worker.dependency "uniter" manifold worker returned unexpected error: failed to initialize uniter for "unit-neutron-api-0": cannot create relation state tracker: "neutron-api/0" is not leader of "neutron-api"
machine-2-lxd-3: 18:24:03 INFO juju.worker.leadership filebeat leadership for filebeat/22 denied
machine-2-lxd-3: 18:24:09 INFO juju.agent.tools ensure jujuc symlinks in /var/lib/juju/tools/unit-neutron-api-0
machine-2-lxd-3: 18:24:09 INFO juju.agent.tools was a symlink, now looking at /var/lib/juju/tools/2.9.32-ubuntu-amd64
unit-neutron-api-0: 18:24:09 INFO juju.worker.uniter unit "neutron-api/0" shutting down: failed to initialize uniter for "unit-neutron-api-0": cannot create relation state tracker: "neutron-api/0" is not leader of "neutron-api"
unit-neutron-api-0: 18:24:09 ERROR juju.worker.dependency "uniter" manifold worker returned unexpected error: failed to initialize uniter for "unit-neutron-api-0": cannot create relation state tracker: "neutron-api/0" is not leader of "neutron-api"
machine-2-lxd-3: 18:24:10 INFO juju.worker.leadership filebeat leadership for filebeat/22 denied
machine-2-lxd-3: 18:24:15 INFO juju.worker.leadership neutron-api leadership for neutron-api/0 denied
machine-2-lxd-3: 18:24:18 INFO juju.agent.tools ensure jujuc symlinks in /var/lib/juju/tools/unit-neutron-api-0
machine-2-lxd-3: 18:24:18 INFO juju.agent.tools was a symlink, now looking at /var/lib/juju/tools/2.9.32-ubuntu-amd64
unit-neutron-api-0: 18:24:18 INFO juju.worker.uniter unit "neutron-api/0" started
unit-neutron-api-0: 18:24:18 INFO juju.worker.uniter hooks are retried true
unit-neutron-api-0: 18:24:18 WARNING juju.worker.uniter.operation we should run a leader-deposed hook here, but we can't yet
unit-neutron-api-0: 18:24:18 INFO juju.worker.uniter.operation skipped "leader-settings-changed" hook (missing)
unit-neutron-api-0: 18:25:11 INFO juju.worker.uniter found queued "leader-elected" hook
unit-neutron-api-0: 18:25:11 INFO juju.worker.uniter.operation skipped "leader-elected" hook (missing)
machine-2-lxd-3: 18:25:11 INFO juju.worker.leadership neutron-api/0 promoted to leadership of neutron-api
machine-2-lxd-3: 18:25:11 INFO juju.worker.leadership filebeat leadership for filebeat/22 denied
machine-2-lxd-3: 18:25:44 INFO juju.worker.leadership neutron-api leadership for neutron-api/0 denied
unit-neutron-api-0: 18:25:44 WARNING juju.worker.uniter.operation we should run a leader-deposed hook here, but we can't yet
machine-2-lxd-3: 18:26:12 INFO juju.worker.leadership filebeat leadership for filebeat/22 denied
machine-2-lxd-3: 18:26:12 INFO juju.worker.leadership neutron-api/0 promoted to leadership of neutron-api
unit-neutron-api-0: 18:26:12 INFO juju.worker.uniter found queued "leader-elected" hook
unit-neutron-api-0: 18:26:12 INFO juju.worker.uniter.operation skipped "leader-elected" hook (missing)
unit-neutron-api-0: 18:26:45 WARNING juju.worker.uniter.operation we should run a leader-deposed hook here, but we can't yet

So the flapping on Leadership sounds like something we’ll want to debug live. Can you bring it up on Mattermost

That should give us an opportunity to dig a little bit more interactively. I don’t know any reason why 2.9.32 wouldn’t be giving out stable leadership information. I would have expected log messages from the controller about the Lease engine failing…

It feels a lot like something where a bit of the uniter agent gets told it is a leader, and then by the time it tries to do something as the leader it has lost it. Which leadership should persist for ~1min from the time we originally told you (without anything else happening like the automatic refreshes/heartbeat, etc) So even the 30s check shouldn’t be failing in that fashion.

1 Like

Since last week this problem has been solved and my controllers have been working perfectly since then. The “SecretsManager” problem was easily solved as noted above, but the problem with charms not having a leader was something completely different.

DISCLAIMER: What I’m posting here is NOT a suggested solution for every similar situation, if you have this issue in a production system, get professional advice! I’m just a Juju user and I’m not from Canonical. But maybe it can be helpful to determine if you have the same problem. I don’t suggest this is how to fix it!

The problem could be identified like this:

$ while true ; do juju run --unit neutron-api/leader is-leader ; sleep 30 ; done
False
True
False
True
ERROR could not determine leader for "neutron-api"
False
True
False

$ while true ; do juju run --unit logrotated/leader is-leader ; sleep 30 ; done
False
False
False
False

I was instructed to check the debug log of the controller model. I have three machines there, machine-0, machine-1 and machine-2. The only machine making noise was machine-1 (Increasing the log level from WARNING to INFO also helped later on):

$ juju status -m controller
Model       Controller      Cloud/Region         Version  SLA          Timestamp
controller  lxd-controller  localhost/localhost  2.9.32   unsupported  12:44:36Z

Machine  State    DNS          Inst id        Series  AZ  Message
0        started  172.23.1.34  juju-e64711-0  bionic      Running
1        started  172.23.2.56  juju-e64711-1  bionic      Running
2        started  172.23.3.46  juju-e64711-2  bionic      Running

$ juju model-config -m controller logging-config='<root>=INFO;unit=INFO'

$ juju debug-log -m controller
machine-1: 14:32:01 WARNING juju.core.raftlease command Command(ver: 1, op: claim, ns: application-leadership, model: 9fe214, lease: influxdb, holder: influxdb/0): invalid lease operation
machine-1: 14:32:01 WARNING juju.core.raftlease command Command(ver: 1, op: claim, ns: application-leadership, model: 7561fb, lease: glance-simplestreams-sync, holder: glance-simplestreams-sync/0): invalid lease operation
machine-1: 14:32:15 WARNING juju.core.raftlease command Command(ver: 1, op: claim, ns: singular-controller, model: 7561fb, lease: 7561fb4a-0b8a-4278-8ccf-41c47a2281bf, holder: machine-1): lease already held
machine-1: 14:32:17 WARNING juju.core.raftlease command Command(ver: 1, op: claim, ns: singular-controller, model: 020f8d, lease: 020f8d98-a6e3-4222-8e46-d35c27e64711, holder: machine-1): lease already held
machine-1: 14:32:32 WARNING juju.core.raftlease command Command(ver: 1, op: claim, ns: singular-controller, model: 03aeec, lease: 03aeece3-344c-4dda-819f-5351359d5eaf, holder: machine-1): lease already held
machine-1: 14:32:58 ERROR juju.worker.dependency "is-responsible-flag" manifold worker returned unexpected error: model responsibility unclear, please retry
machine-1: 14:32:58 ERROR juju.worker.dependency "is-responsible-flag" manifold worker returned unexpected error: model responsibility unclear, please retry
machine-1: 14:32:58 ERROR juju.worker.dependency "is-responsible-flag" manifold worker returned unexpected error: model responsibility unclear, please retry
machine-1: 14:32:58 WARNING juju.core.raftlease command Command(ver: 1, op: claim, ns: application-leadership, model: 9fe214, lease: logrotated, holder: logrotated/72): lease already held
machine-1: 14:32:58 WARNING juju.core.raftlease command Command(ver: 1, op: claim, ns: application-leadership, model: 9fe214, lease: logrotated, holder: logrotated/70): lease already held

From here I logged in to every controller and ran the command juju_engine_report. This produces a lot of output, but the interesting parts were in the raft: part:

On machine-0, it is clear that 172.23.3.46:17070 (machine-2) is the leader and it considers itself follower:

  raft:
    inputs:
    - clock
    - agent
    - raft-transport
    - state
    - upgrade-steps-flag
    - upgrade-check-flag
    report:
      cluster-config:
        servers:
          "0":
            address: 172.23.1.34:17070
            suffrage: Voter
          "1":
            address: 172.23.2.56:17070
            suffrage: Voter
          "2":
            address: 172.23.3.46:17070
            suffrage: Voter
      index:
        applied: 292534506
        last: 292534506
      last-contact: now
      leader: 172.23.3.46:17070
      state: Follower
    start-count: 1
    started: "2022-07-12 14:16:04"
    state: started

On machine-1, raft doesn’t seem to be able to start:

  raft:
    inputs:
    - clock
    - agent
    - raft-transport
    - state
    - upgrade-steps-flag
    - upgrade-check-flag
    state: starting

On machine-2, it is clear that it considers itself leader.

  raft:
    inputs:
    - clock
    - agent
    - raft-transport
    - state
    - upgrade-steps-flag
    - upgrade-check-flag
    report:
      cluster-config:
        servers:
          "0":
            address: 172.23.1.34:17070
            suffrage: Voter
          "1":
            address: 172.23.2.56:17070
            suffrage: Voter
          "2":
            address: 172.23.3.46:17070
            suffrage: Voter
      index:
        applied: 292538871
        last: 292538871
      leader: 172.23.3.46:17070
      state: Leader
    start-count: 2
    started: "2022-07-12 14:13:19"
    state: started

I then figured out that I could use strings on the raft database to find out more (but this not a scientifically proven method!):

On machine-0, the IP of machine-2 is shown which seems reasonable from the previous output:

$ sudo strings /var/lib/juju/raft/logs | grep :17070
LastVoteCand172.23.3.46:17070LastVoteTerm
LastVoteCand172.23.3.46:17070LastVoteTerm
LastVoteCand172.23.3.46:17070LastVoteTerm

On machine-1 it shows its own address, but the other’s don’t consider this machine leader!

$ sudo strings /var/lib/juju/raft/logs | grep :17070
SLastVoteCand172.23.2.56:17070LastVoteTerm
SLastVoteCand172.23.2.56:17070LastVoteTerm
SLastVoteCand172.23.2.56:17070LastVoteTerm

On machine-2, it shows its own adress, which machine-0 and machine-2 agrees on:

LastVoteCand172.23.3.46:17070LastVoteTerm
LastVoteCand172.23.3.46:17070LastVoteTerm
LastVoteCand172.23.3.46:17070LastVoteTerm
LastVoteCand172.23.3.46:17070LastVoteTerm

To get out of this mess, I stopped juju-db.service and jujud-machine-N.service using systemctl stop <service> on all three controllers. Make sure you stop the machine service too and that all juju+mongodb processes remain gone - the machine service will respawn mongodb if it isn’t stopped.

Now that all controllers are down, I moved away the file /var/lib/juju/raft/logs on machine-0 and machine-1, but kept it as-is on machine-2. Since machine-2 was considered leader, I now restarted the Juju service on the controller. After a short while, I restarted the other two controllers. After this, all three controllers considered machine-2 leader and things worked again! juju debug-log -m controller shows no errors anymore.

My only theory on why this happened in the first place is that the LXD storage hosting machine-1 went low/out of space some months ago.

By the way, you do your controller backups regularly, don’t you? https://juju.is/docs/olm/controller-backups

Many thanks to Canonical Support, @jameinel and @manadart for very professional help.

1 Like

Thanks for writing this up for others!