Raft API Leases - Part II

Thank you for going through all this work. It is very good to see direct analysis. It would be interesting to go through the dashboards with you and see if we can pull out some clear signals out of the data.

For example, it is interesting to see how in all cases (even at the 100x load) we are seeing client-side timeouts. It actually appears that if you don’t get a response in 1s, you just don’t get a response.

The second set of graphs where they are on the same screenshot are a bit harder to interpret. It is very nice to get them on the same axis, though. Do you have ideas on why the number of goroutines is so much higher, but still ‘steady’? And memory consumption seems to correlate with that.

If I’m reading it correctly, the Pubsub implementation only achieves about 210 successful claim-extends per second, while the API does have higher memory and goroutines, but gets to over 300 claims per second. I’m also trying to understand why we see ~300 ‘controller-0 extend success’ messages, but nearly 1000 ‘controller-0 fsm apply count’. Is that because all 3 controllers are successfully extending leases, so we are applying significantly more?

You stated:

Either the graph is missing, or it is the second group of graphs and we are doing quite well.

Looking back at the first two screenshots, you have a dashboard in one for “Raft lease operation 99th percentile”, but the second one has “Raft lease application 99th percentile”. Is that just a typo and they are the same data? (They seems to have the same trend labels.) If it is, it is interesting that there are Timeouts in the PubSub case but none in API, and you have much faster failure response times. (API 99th percentile is hitting ~35ms for a failure, vs pubsub hitting 183ms failure and 5s timeouts) The slowest 99th percent for API was maybe 80ms, while the minimum for pubsub success was 167ms.

It would be good to understand if there is a missing screenshot here, because I’m not seeing pubsub outperforming API, but it is plausible the failure modes are very different.