Freezing Leadership During Series Upgrade

manadart · 13 September 2018 11:25

Context

At present, when performing an upgrade of Ubuntu underneath Juju agents, it is possible for application leadership to change while unit agents are stopped for preparation.

There is a requirement from the field that this does not happen - effectively a request to “freeze” application leadership for the duration of such upgrades.

Approach

It has been suggested that we only need to consider acting in this capacity when there are leader units on a particular machine being upgraded. However an operator may wish work outside the scope of a single machine upgrade; for example, preventing leadership elections while all non-leaders are upgraded, so as to avoid handing leadership up or down versions.

With this in mind it has been suggested that we expose such a facility as a separate client command in order to ensure that operators can exercise the control they need over such changes.

During discussions of this requirement, consensus has been arrived at for leadership logic not to be modified with code particular to series upgrade concerns; rather that a general lease “pinning” feature be introduced, which series upgrade and future features might recruit.

Design

The immediate focus will be on the general lease pinning facility, not on a new client command.

The leadership layer above the lease implementations does not present an API that gives enough control to do pinning. It is in the lease layer itself that this logic needs to reside. There are currently two lease implementations:

State-based (legacy)
Raft

In the develop branch of Juju, Raft is used by default for leases. There is a feature-flag that allows falling back to the legacy state-based lease management.

Possible Method: Pin by Invoking a Long Lease Extension
Under this method, the call to pin the current leader for an application would involve extending the current leader’s lease by some long period, say 10 years. Unpinning would involve setting the expiry back to a standard duration.

Possible Method: Pin by Storing Information in State
This method would involve creating a new collection into which pinned leader units are written and from which they are later removed. It would be consulted and used to circumvent expiry workflow. This presents challenges in the Raft implementation which does not use a connection to MongoDB.

Open Questions

Should we be introduce the pinning capability to the legacy lease logic in addition to the Raft implementation?
What will be the mechanism for pinning?

rick_h · 13 September 2018 18:55

How is the lease itself defined? Is there something we can ammend to the lease such that when pinned a new lease is created with an annotation on it. Then when the lease expires does that become available/visible? I’m guessing if the lease times out the previous one is just dropped and the participating parties come together requesting a new one and the lack of carry over makes this not workable.

I guess I’m wondering if there’s a lighter method vs creating a new state collection we can annotate already existing details in state to indicate it’s pinned.

wallyworld · 14 September 2018 04:37

We also come across the same issue with model migration. There’s a hacky solution in place IIANM where the lease expiry times are extended by an arbitrary amount to allow for the migration to complete. But what’s there is not acceptable long term. So this is a good chance to come up with a holistic approach that caters for these different scenarios.

manadart · 14 September 2018 07:25

Thanks @wallyworld. Good tip.

Relevant logic looks to be here.

manadart · 14 September 2018 07:54

@rick_h as I understand it, with Raft we maintain an in-memory store that contains the results of Raft’s consensus determination. We can control the data handed around in commands and stored as lease entries, but when Raft decides that something expires our only action is to send a notification that it happened.

@babbageclunk can correct me if I am barking up the wrong tree here.

babbageclunk · 17 September 2018 12:59

(We talked about this at the sprint, but posting here for posterity.) Raft has its concept of leadership that we can’t really control (the raft nodes will elect a new leader when they decide that the current leader’s gone away), but leases aren’t tied to that. Leases are only expired by our code running in the FSM, in response to time updates from the clock updater worker, so tweaking that logic to check for pins or long leases (rather than just checking expiry time) is definitely feasible.

jameinel · 25 September 2018 09:42

I wanted to discuss why I think it is ok to pin by extending the lease and then setting the lease short again.

If our contract with clients is that they ask for a lease of “at least X more seconds” and we return that we have given them what they requested. As long as we don’t tell them how much longer we won’t elect someone else, then at any point we should be able to set the remaining time to X seconds. That way we know we haven’t ever told them a lie and taken away their leadership in less than X seconds.

I do think this has the potential for abuse (people locking leadership for other reasons and then forgetting to ever unlock it). But I think it is a better way to do it than the other things we’ve discussed.

(sorry for the lateness, I had written this a while ago, but forgot to hit send)

manadart · 1 October 2018 15:09

Recent discussions around the scope of freezing leadership as it relates to series upgrades have brought up the following considerations:

The freezing of leadership might not be desirable as a blanket behaviour - it may be perfectly reasonable for re-election to happen while a leader is down for upgrade, or when the leader is not currently involved in an upgrade.
The scope of leader freezing may not be effective over a single machine upgrade - it may be necessary to freeze leadership while all machines hosting units of an application are upgraded.

This would tend toward changing the approach outlined in the original post, to giving control via an independent client command as the first implementation of “pinning”.

In turn, status reporting should include information on any current frozen leadership, so that every user-driven pinning should have a corresponding un-pinning.

rick_h · 1 October 2018 18:12

I’ve been trying to think through the issues of a manual pinning step and it comes down to a few fundamentals.

if we have a manual pinning step then the operator needs to know, per application, what requires pre-pinning/release vs not. That’s something for the charm author to know best about failover/upgrades imo.
If the operator misses pinning leadership can it cause issues? Any time you’ve got multiple steps to get one flow done there’s the chance for mistakes and the cost of those mistakes we want to keep as low as possible. In thinking through things that might work this way storage is the only one I can really come up with. It requires you to first create pools then to use them/etc. There’s a lot of work to try to make sure that if you forget or want to do it after it’s ok though.

In this light I’d like to propose that we go the route of having an opinionated default path but enable the manual pinning as a get out of jail card for optional use by folks that know they need it.

In this flow we encourage charm authors to have an expected default behavior. This is also true of the managed updates (generations) work in the future. Juju has an opinion and best practice but allows you to get around the rails when needed.
While an upgrade is in progress you’re protected from leadership changes, however once a machine is upgraded (and note that all of the other units might not be on upgraded machines) you can hit leadership issues/fail over.
If an issue comes up during an upgrade you have an active operator at the helm because the upgrade process is interactive. This provides some safety for the default pinned leader issue.

To do this we still need the cli command and methods of indicating leadership that’s pinned. We know we want this behavior/control for the managed upgrades work as well and during model-migrations. We also need a flag to the upgrade-series command to indicate that no leadership pinning is requested. The issue here is that the decision needs to be application based vs machine based. So this might need to be a list of applications to not automatically pin leadership on during the upgrade process.

jameinel · 7 October 2018 14:46

I think in the grand scheme of things it is very appropriate for it to be determined by the charm.

However, we don’t currently really expose to the charm the overall life cycle of upgrading all machines that the application is deployed on.

My personal concern is that the lifetime of the pin for an application really does span what is being done on any one machine. (If the failure mode is that changing leader from an upgrade charm to an unupgraded leader would break things, then the leadership needs to be pinned until everything has been done.)

A thought is that a charm could nominate what units it wants eligible for leadership. When the Leader gets the “pre-upgrade-series” hook, it could nominate that it is the only one allowed. And then when a secondary runs its post-upgrade-series it could note that it should now be considered a viable alternative.

Maybe by default we nominate that only units at the same charm version as the current leader are valid candidates for leadership?