Intro and purpose of this post
Before reading any further, I would like to clarify that this post is not a proposal for replacing mongo. It is rather an attempt to question whether the original design decisions (made a long time ago) behind the choice of using mongo as our de-facto backing store for Juju’s state are still applicable to the shape/form that Juju has today given the push to running juju not just at the data-center but also at the edge (think IoT, k8s clusters for local development etc.).
The purpose of this post is to explore the technical feasibility and performance characteristics of an alternative approach that relies on the use of raft as the primary mechanism for storing and replicating the contents of Juju’s database.
Raft is one of the prominent distributed consensus algorithms. In a nutshell, raft implements a distributed log of user-defined commands that is replicated across the nodes that are part of the cluster and is used to drive a user-defined finite state machine (FSM).
So, why use raft in the first place? Well, Juju already uses raft (hashicorp/raft) for keeping track of leadership leases. Note that we don’t rely on the HA features provided by raft as the leadership state can be rebuilt from the DB in case the raft log gets deleted from the controllers.
The idea of using raft is based on the observation that, even for controllers with a large number of models, the actual amount of data that describes Juju’s state is quite small (in the order of a few megabytes). Note that the above definition of state data does not include logs and charm blobs which one could argue that shouldn’t be stored in the database in the first place. The following sections go into more details about how blob data can be handled by such a system.
How would it work?
Such a system would be a text-book application of the Command Query Responsibility Segregation (CQRS) pattern and would consist of two components: a serialized transaction log and an in-memory FSM-driven view on the database state.
The raft log provides the means for implementing a serialized transaction log where each entry corresponds to a list of
mgo/txn.Op values that are logically associated with an individual transaction generated by the juju controller.
Once the log entries have been successfully replicated by the raft subsystem, nodes submit them to an FSM which applies them (subject to all assertions being satisfied) to the in-memory representation of the DB state. The in-memory state is only accessed when clients perform read queries.
Note that while an in-memory implementation is discussed here, we are effectively describing an interface for applying
txn.Op and performing read queries. As such, it can be substituted with a different back-end (e.g. write to sqlite, badger or even plain files).
The in-memory store
You might be wondering why the system works with
txn.Op which effectively contains mongo-specific information (e.g. bson.D, bson.M values, mongo operators like
$push etc.). Remember that due to the way that Juju has been implemented, the model, business logic and persistence layers are inherently coupled. Consequently, any solution that could potentially replace mongo should ideally work as a drop-in replacement.
To this end, the in-memory implementation would internally maintain a
bson.D value for each document (plus a map to accelerate
_id lookups) and implement a small layer for handling the most common mongo update operations. Given that transactions are serialized, the in-memory store would simply need a global
sync.RWMutex lock to control access.
As the access patterns used by juju are well known and understood, additional indexes can be implemented on top of the store to further accelerate specific types of queries (e.g. model or machine UUID indices).
Snapshots, adding and recovering crashed nodes
Occasionally (a tunable option of raft), the raft subsystem asks the leader’s store to produce a snapshot which then replaces a subset of the transaction log. This allows nodes to quickly catch up when they come online by loading the latest snapshot and replaying the remaining transactions from the log.
When a controller node comes online, it would wait until it has applied all pending log entries before allowing any R/W access to the database.
When a client attempts to apply a transaction, the transaction system checks whether the node is currently the leader. If it is, it appends the transaction to the log and blocks until the transaction gets replicated to the log and either applied or rejected (e.g. due to a failed assertion).
On the other hand, if the node is not the leader, the transaction system would transparently forward the transaction to the raft leader (all controllers are connected in a mesh topology) and block until the leader responds with the transaction outcome (success or failure).
Instead of using a polling system like we do now, this approach would enable push-driven watchers. Watchers can register directly on the store or use a distribution system like the hub watcher to fan out change events.
Juju access patterns and eventual consistency
As all raft nodes will eventually execute the same transaction operations in the same order, all in-memory store instances are bound to be eventually consistent. Let’s see how this aspect of the system affects the typical Juju query patterns in a multi-controller (HA) scenario.
As far as watchers are concerned, there are no issues with eventual consistency. When a watcher starts, it first receives (event 0) a snapshot of the current state and then notifications for any future model mutations. As a result, watchers do not necessarily need to run on the leader but can potentially run on any node in the cluster.
What about API clients and business logic running on the controllers themselves (workers)? This is where the concept of tunable consistency comes into play. Most Juju state mutations follow a pretty straightforward pattern: read a bunch of models (backed by mongo docs), apply a set of mutations and commit them inside a transaction. In addition, the majority of generated transactions also include a series of assertions against the documents’ revision numbers. If the transaction fails (typically due to a mismatch in a doc revision number assertion), Juju will generally reload the mutated objects from the DB and retry the transaction until it succeeds or eventually bails out.
For this kind of access pattern, we can use eventual consistency to our advantage and generate these transactions on any of the controller nodes. Remember that the transactions are appended to the log by the leader and are applied in exactly the same order by all nodes. If a prior transaction in the log has already bumped the revision number for a doc, the assertions for transactions working on a stale read will fail and the transaction will be rejected by all FSMs.
As a result, we can additionally load-balance API calls across controllers and make better use of the available resources.
What about the blob store?
The blob store (and logs) are the only thorny issues that this system would not be able to directly handle as we would want the raft logs to be compact and therefore, storing binary blobs there would defeat the purpose.
An alternative approach would be to have a watcher per controller that monitors the charm collection and downloads/caches (to the local file-system)/removes charm blobs as the collection gets updated. A small caveat here is that each controller would download its own copy of the charm but that’s usually not an issue are charms tend to be rather lightweight.
For local charm use-cases, controllers could ask other controllers in the cluster and grab a copy of the charm either from the controller that the charm was originally uploaded to by the juju client or from any controller that has already cached it.
Aren’t you re-inventing the wheel?
A common argument against such an approach would be something along the lines of “so, you are going to re-implement mongo in-mem db/$yourdb from scratch?”.
The short answer is yes… but! Even though we would be effectively implementing an in-memory DB, the implementation itself would be smaller and constrained to the features we need for our particular use-case.
Another argument could be “if you are going to use another DB, why not use dqlite and hide the raft details under the carpet?”. Obviously, using an externally maintained and tested DBMS would be a better solution, what prevents us from doing so is once again the way that Juju is implemented. While the model state would (in my opinion) be a much better fit for an RDBMS rather than a document DB, it would take significant more effort to introduce an abstraction layer at the
state package to allow us to gradually migrate into a relational model.
What are the potential gains of such a system?
The biggest benefits from such a system would be that the amount of resources required by controllers would be significantly smaller since we would not have the extra overhead of having to spin up a mongo instance. As a result, juju controllers could run on much smaller instances (e.g. rPI or low-resource k8s clusters).
In addition, having an in-memory representation of the DB makes queries and transaction application operations very fast. Network hops are only needed when we try to apply a transaction from a non-leader node (reads are handled directly by the local in-memory store) and the in-memory store removes the need of maintaining a caching layer.
Finally, backup and restore becomes much simpler. The store snapshot itself serves as a backup and raft replication ensures that we can reliably install a snapshot to all nodes.
It sounds too good; what are the caveats?
Of course, the proposed approach has a series of interesting caveats which is mainly the reason for the this-is-not-a-proposal-to-replace disclaimer at the top of the post.
The biggest caveat is the devops cost associated with such a change. Using mongo means that we get to use its battle-tested set of tools (e.g. mongotop) and to access and modify the database at run-time. This is quite important as controllers sometimes get stuck and we need to resort to DB surgery (delete stuff, add indices etc.).
Moving to a custom in-memory implementation would mean that similar tools would need to be implemented. Now, this is not that hard but it would require additional effort on our end. For example, adding a small REPL (fun facts: Juju already has one for supporting the dashboard) to run simple find-by-X, modify indices or run pre-canned queries would not be too hard. For more complicated queries while debugging a broken controller for instance, we could always grab a snapshot (it’s just raw bson documents) and have a small helper script to unpack it and insert the documents into a local mongo instance which we can later inspect.
Another caveat has to do with transaction throughput. As transactions are serialized to the raft log, chatty models (generating a large volume of transactions) could introduce extra latency to the time it takes for transactions generated by other, less chatty, models to be applied. However, this is something that can be measured (see next section) and raft itself can be scaled up (e.g. using multi-raft with different groups/logs per model).
Finally, note that some of the things discussed in this post (e.g. allowing stale reads, and load-balancing controllers) can be implemented with our current mongo setup as well (e.g. targeting mongo replicas for the reads).
Show us some code!
While this post serves mostly as food for thought, I am quite keen on getting people discussing this idea further. To this end, I have created a small self-contained proof-of-concept repository that can serve as a benchmark platform for evaluating such a system.
At the moment, the provided code allows you to setup a multi-node raft cluster with an in-memory or disk-based raft log and import the contents of a live juju DB. For each mongo collection, the DB importer generates and applies (through the leader) a transaction containing all the documents in the collection. After importing a collection, you can restart (or add new) nodes and observe how they rebuild the in-memory DB state by replaying the transaction log.
With the basic plumbing in place, the next step is to come up with a series of synthetic transaction benchmarks (with different mixes of read/write workloads), execute them against an imported (anonymized) dump of a production controller’s DB and use a tool like prometheus to capture and analyze various performance-related metrics.