Raft API Leases

simonrichardson · 15 October 2021 09:54

Starting with 2.9.17 Juju will offer a new feature flag for handling lease transport. The new experimental feature allows backpressure for callers when attempting to perform a lease operation. This should prevent raft from being flooded with calls to append logs to the raft log whilst maintaining a reasonable throughput.

As this is an experimental feature and more work is required to ensure it can cope with larger setups, it can be turned played with using the feature flag of a controller:

Either when bootstrapping:

$ juju bootstrap lxd test --config features="[raft-api-leases]"

Or via the controller config. Unlike bootstrapping, it is required to restart the controller agents on all controllers to ensure that each agent picks up the changes:

$ juju controller-config features="[raft-api-leases]"

Additional information

The current implementation forwards all messages through jujus internal central eventing system (juju/pubsub repo), there are few flaws with this system:

If no connection to a remote server is found, all messages are dropped. Worse still, no errors are returned in this case. The lease system can believe that in certain situations a claim, extension, or revoke may of succeeded.
If raft can not keep up with demand, by processing the lease requests to the log, then back pressure on the system isn’t handled. The events are placed in a queue and at some point in the future, they’ll be processed, except by the time they’re ready for processing, they’ve already expired! You can’t cancel an event once placed in the queue, you can just tell the caller of the event (via a subscription) that the event they’re waiting for will never come and move on.

The work I’ve been doing is to fix both of these problems. Instead of sending an event over the central hub, we’ll use the API server for all lease requests. That allows us to use the battle-tested facade code, along with cancellable requests from the caller. Any API client requests that are not backed by a remote server connection will return an error to the lease manager, allowing a retry strategy appropriately.

Raft is now fronted with a queuing system that can coalesce command operations for batching and can exert the correct back pressure where appropriate.

Known issues

Dealing with backpressure

Currently the lease manager doesn’t deal with backpressure efficiently. When having lots of applications and lots of units for each application, restarting an agent(s) causes a thundering herd. The controller will struggle to come back up, fortunately it does after sometime. This is because the retries cause jitter and it can be concerning that the controller isn’t responding.

Ongoing work to deal with back pressure via the aid of batching logs is currently happening.

Additional information:

jamesbeedy · 16 October 2021 14:43

It’s very exciting to see this.

This looks like something that could potentially be a solution for the behavior we experience where unit logs don’t make it into the system. Thank you for posting this. Looking forward to see how things turn out here.

~James