Leadership Elections and Lease Operations

Hello all,

I am hopeful someone can help me figure out how to debug a problem I am having with lease operations and leadership election.

I am trying to follow the LivePatch On-Prem deployment tutorial to deploy a LivePatch server on my local k8s cluster.

I’m feeling my way through this as I’m new to Juju and Charms, but I got to the point were I have Juju setup, the LivePatch Charm bundle deployed, and I can see the pods running in my cluster.

However, the pods never reach readiness, and each seems to be complaining with the same error:

ERROR juju.worker.dependency engine.go:695 "leadership-tracker" manifold worker returned unexpected error: leadership failure: lease operation timed out

I’m trashing on this issue because I cannot figure out what is causing the timeout to occur. Thus far I have not been able to figure out how to get more information out, so I’m looking for some suggestions on what steps to take to further debug this.

I’m also a little confused why I’m having a problem with leadership even though I’m not running a HA setup (at least, that’s not my intention at the moment). If I only have a single unit for an application, why is it still going through a leadership election?

Here’s the status for my setup from Juju, to help illustrate:

juju status
Model      Controller  Cloud/Region  Version  SLA          Timestamp
livepatch  kubernetes  kubernetes    3.1.6    unsupported  08:20:51-05:00

App         Version  Status   Scale  Charm                           Channel        Rev  Address         Exposed  Message
ingress     25.3.0   active       1  nginx-ingress-integrator        latest/stable   81  10.111.165.16   no       
livepatch            waiting      1  canonical-livepatch-server-k8s  latest/stable   21  10.111.85.31    no       waiting for units to settle down
postgresql  14.9     waiting      1  postgresql-k8s                  14/candidate   158  10.109.182.135  no       waiting for units to settle down

Unit          Workload  Agent  Address     Ports  Message
ingress/0     active    idle   10.36.0.17         
livepatch/0   blocked   idle   10.47.0.5          waiting for pg relation.
postgresql/0  waiting   idle   10.36.0.15         awaiting for cluster to start

Thanks!

Hi, that’s interesting. It appears juju is failing to determine who’s leader in your applications. It doesn’t matter how HA your deployment is, juju will always elect a leader. So in your case I’d expect each of the units to be leader of their own single-unit application. A healthy deployment should look like:

image

(ignore the error: that’s a charm in development I’m working on) Note the asterisks next to each unit name: that means that that unit is leader. Only (legitimate) reason I know of that might cause juju to fail assigning leadership is if a unit is stuck in an unresponsive state for a long period of time, but I can’t think of what could cause ALL of your applications to do that simultaneously.

What I’d try:

  • Can you ascertain that the k8s cluster is healthy? Can you ssh into the various units and check that stuff is alive as it should? (is nginx running, etc…)
  • Can you try to scale up one or all of the applications and see if a leader is elected?
  • Does the debug-log show anything interesting? Try fiddling with model-config logging-config=<root>=DEBUG to see if you can get more output.

@wallyworld may have more suggestions

Hey Pietro,

Thanks for the explanation and suggestions, that did push me in a direction and I found maybe more of a trace of what’s going on (but no resolution yet).

  • I did poke around my cluster for a bit and some other capabilities I have in there appear to be in good state (working) - the problem really seems to be isolated to the Juju Charm containers.
  • I haven’t figured out how to scale up the application from Juju; if I do it from k8s directly, the pods that get created don’t launch properly. Which made sense to me since the configuration in Juju wouldn’t match with what k8s actually has running.
  • I did poke around the log files, and changed the log levels, which did uncover some new information I didn’t have before.

I’d kinda ignored the log files for the controller (since it shows healthy), but while I was playing around with the logging configuration, I did peek at those, and there are some signs in there of things not working right. When I trace the logs for the management of leases, I see this:

controller-0: 09:13:16 WARNING juju.core.raftlease response timeout waiting for Command(ver: 1, op: claim, ns: application-leadership, model: 882f89, lease: ingress, holder: ingress/0) to be processed
controller-0: 09:13:16 TRACE juju.core.raftlease runOnLeader claim, elapsed from publish: 5.002s
controller-0: 09:13:16 TRACE juju.worker.lease.raft [049ade] timed out handling claim by ingress/0 for lease ingress, retrying...
controller-0: 09:13:16 TRACE juju.worker.lease.raft [049ade] ingress/0 asked for lease ingress, no lease found, claiming for 1m0s
controller-0: 09:13:21 WARNING juju.core.raftlease response timeout waiting for Command(ver: 1, op: claim, ns: application-leadership, model: 882f89, lease: ingress, holder: ingress/0) to be processed
controller-0: 09:13:21 TRACE juju.core.raftlease runOnLeader claim, elapsed from publish: 5.001s
controller-0: 09:13:21 TRACE juju.worker.lease.raft [049ade] timed out handling claim by ingress/0 for lease ingress, retrying...
controller-0: 09:13:21 TRACE juju.worker.lease.raft [049ade] ingress/0 asked for lease ingress, no lease found, claiming for 1m0s
controller-0: 09:13:32 WARNING juju.core.raftlease response timeout waiting for Command(ver: 1, op: claim, ns: application-leadership, model: 882f89, lease: ingress, holder: ingress/0) to be processed
controller-0: 09:13:32 TRACE juju.core.raftlease runOnLeader claim, elapsed from publish: 5.005s
controller-0: 09:13:32 TRACE juju.worker.lease.raft [049ade] timed out handling claim by ingress/0 for lease ingress, retrying...
controller-0: 09:13:32 TRACE juju.worker.lease.raft [049ade] ingress/0 asked for lease ingress, no lease found, claiming for 1m0s

Based on the “ingress/0 asked for lease ingress, no lease found, claiming for 1m0s” that appears continuously, it seems to me that the lease is being created for the application leader, but somehow it falls off the rails… Judging by what happens after that, the claim is not communicated out (i.e. no response, thus it times out) and when it retries it doesn’t find the lease (that supposedly was just claimed), so it’s not retaining it either.

I want to do some more digging in this area to see if there’s anything else in the log file that helps explain this sequence of events, but even with TRACE this is all the detail I’m getting out of these modules.

Connect to the controller’s container. If you source /etc/profile.d/juju-introspection.sh, you should then be able to run juju_engine_report.

Paste it here. I don’t believe it will container any sensitive information.

That’s a good thing to know, thanks for the suggestion.

I did however figure out what the source of all my problems was… To deploy juju on my kubernetes cluster, I had to create a storage class; I’d created this using nfs-subdir-external-provisioner.

However, I made a mistake when I set up the NFS volume that would be used for this, so the ownership on the files created by/for the juju deployments didn’t get preserved.

I thought I’d fixed it, but after I looked at the “raft” subdirectory for the controller, I realized it had not been updated in days, so I figured something was still off with the way the NFS volume was being used. After I started opening up the volume, things started moving again and I finally started seeing some more information in the logs that led me to figure out how to resolve this issue completely.

Once I got there, the pods/containers for the juju deployments came up and started acting like they should.

So this issue is resolved; thanks for the help and suggestions!

1 Like