Hello everybody, me and my team at Katharos Technology wanted to share our critical review of what Juju is like today. None of this is stated just to senselessly bash on Juju and its problems, but is an honest evaluation of Juju based on our experiences with it. We want to share our experiences and what we have learned, both the good and the bad, so that we can help Juju grow and become better.
We have the utmost respect for Juju and the people involved both from Canonical and from the the community. We aren’t trying to point fingers or tell people that “you should do better”. Some of these points are even well on their way to being address in one way or another such as better documentation. This is all meant to be constructive criticism so that we can collaborate and make stuff happen!
Just a warning, this is going to be a long one, so, here we go!
Juju’s Strengths
Holistic Approach to Automation
Juju taks a unified approach to the enterprise automation problem. It doesn’t just cover one of provisioning, configuration management, deployment, or operations, it covers all of them. This is a requirement to truly automate things like scalable deployments of stateful applications such as databases. You can achieve the golden “push button scaleability” where the only thing you have to do to add a server to your application cluster and get it fully configured and user-facing in production is to run a single command or click a button.
This is not something that is provided anywhere else.
You have tools such as Terraform, Rancher, Docker Swarm, Kubernetes, Chef, Ansible, and others, but none of them cover the entire landscape of infrastructure, application, and operations automation at the same time.
Declarative Configuration + Programmable Configuration
Juju allows for a declarative application stack definition with its YAML files to define Juju Models. Declarative configuration is a key component of making systems more reproducible and source controllable, but only having declarative configuration can make a system too opinionated to adapt to different user’s use-cases and applications.
Juju combines the declarative configuration of Models with the programmable configuration Charms. This combination allows you to hook into the application’s automation life-cycle and share information with other charms to coordinate complicated application deployments in a way that other orchestrators like Swarm and Kubernetes do not allow.
This also, is something not found anywhere else, and its flexibility was the only reason it was possible for us to make the Lucky Charming Framework.
Juju’s Weaknesses
Barriers to Charm Deployment & Management
These are some of the obstacles we’ve noticed when it comes to using Juju and deploying/managing charms with it:
Phasing Out the Interactive Juju GUI
Lately Juju has been working on a new Juju GUI that focuses on wide-scale visibility while allowing you to control the Juju cluster by using an embedded CLI in the web interface. This is contrary to the old Juju GUI which allowed for graphical drag-and-drop/point-and-click control of your Juju cluster in addition to having a built-in web CLI.
We believe the the value of the old Juju GUI has been underestimated and we would much rather preserve the ability to control the cluster through the GUI, not just with an embedded CLI.
To clarify, we are not foreign to the command-line in in any respect. We spend all day on the command-line in our work, but that is not to diminish the value of performant and responsive GUIs for managing wide reaching changes faster than you can do so on the command-line. There are a number of reasons:
Operator Friendliness
In most enterprises, it is not the person who develops the app that manages the operations for that app in production, it is the operators. The operators are handed the controls for an application that they do not have an in-depth knowledge of and tasked with being able to handle the possible problems that could happen with an application at scale in production.
Intuitiveness is going to be key to making sure that hand-off from the developer to the operator goes smoothly and properly empowers the operator to take on the tasks that will be necessary to manage the application.
Having a GUI as obvious and intuitive as the old Juju GUI would be a huge advantage. If a new server needs to be created with specific resource constraints and a new unit for a specific application added to that unit, it could hardly be easier than to say, “add machine”, “set CPUs/Memory”, “add this unit over there”, “oh, and maybe this unit, too”. This is demonstrated in the LizardFS tutorial that I made with the Juju GUI:
There are improvements that could still be made to the old Juju GUI, such as a way to run charm actions, maybe, or a better “zoomed out” view allowing you to view the status of multiple models at a time, but the core functionality there is invaluable and something that me and my team have not experienced since leaving Rancher, the only software we have yet seen that delivers a truly amazing real-time, interactive orchestration dashboard.
Speed and Context During Dynamic Operations
Another advantage of the GUI is speed and context. When charms can change status every second and something goes into an error state, you want to know why and you need to find out quickly, or the status could change again before you could see what caused the problem.
You need to be able to see that a charm or unit is unhealthy or errored and click on that charm or a button next to it to get the logs for the charm instantly. Then you need to be able to close those logs and click the log button of another charm to see its logs. If I have to type out a long winded command to get the logs on the comand-line then I have to type that command out again to get the logs for the other charm, I’m not going to get to it soon enough. I then have to then scour the verbose debug logs of the charm to find out where it went wrong, after the fact.
Rancher was an example of an amazing workflow: all of the application units were represented by littler circles that were either , , or . You could right-click on any of them to get the logs or shell into them right from the web GUI instantly. You could also stop/restart the units by clicking them or modify the charm configuration.
The icons were small and organized by model and charm ( converting into Juju terms, anyway ). You could quickly find what you were looking for and because the unit icons were small you could see the status of tens of units at a glance. You could also click on the unit to get unit-specific information in a pop-out drawer. It even had graphs for the CPU, Memory, Disk, and Network usage!
The Juju GUI wasn’t quite there yet, but it was almost there. That workflow is amazing, and something we used extensively when managing our applications in Rancher. We were confident because we could control the cluster quickly and view and respond to changes without having to type long-winded commands or look up the help messages for the CLI just to figure out how to use it before we missed what we were looking for.
Wider User Base
Juju has a smaller user base than we want it to have. It has been reported that Juju kind of needs an expert to use or teach people to use it:
That being said, who do we expect to be current users of Juju today. They are most likely going to be DevOps experts or else very adventurous devs. If we want to improve and extend our user-base, then that will involve many people who are not necessarily the ones who spend all day on the command-line. Users who would greatly apreciate an interactive GUI or maybe not even be able to use Juju without it. We can’t cater just to the people who may be using Juju now, but also the people we want to be using Juju, which is a larger audience than Juju is currently successfully targeting.
Application Logs Need to Be Accessible ( through GUI and CLI )
Juju allows charms to log output that is accessible from the the Juju GUI and from the Juju CLI, but it doesn’t have any facility for accessing the application’s logs. This is essential to understanding why your application is behaving a certain way.
We run all of our application workloads in containers where the standard output of the container is always accessible as the source for application logging. It provides a standardized way to get to the application logs without having to have a knowledge of the application and the way it was configured and where to find its logs on disk. It’s always in the docker log
.
Even if you knew where to get the logs for your application it is still a two step process to get them with Juju: you have to juju ssh
into the unit that you want to investigate, then you have to tail the log that you are interested in. That can be a lot of typing and, if you have a lot of units you need to check on, it can be very cumbersome and difficult to get the big picture.
Juju needs a mechanism not just to get the Charm logs for debugging the charm, but also to get the application logs for debugging the application.
Implementation idea: This could possibly be implemented as a UDP endpoint or maybe just a file descriptor that you could
ncat
or pipe the application logs to. There needs to be a way to capture the logs for the application without needing to call a command likejuju-log
for every line.
Cascading Irrecoverable Error States and Automatic Machine Removal
Juju charms, for safety purposes have a rather conservative behavior when any charm hook fails. The charm will go into an error state and essentially freeze all operations ( other than retrying the previous operation ) to prevent data loss. The issue with this is that that it tends to result in cascading irrecoverable error states when faced with a bug in the charm.
For example, I ran into a situation where I had an HTTP proxy charm that had a bug that caused a hook failure under certain conditions. I related this HTTP proxy charm ( not knowing about the bug ) to a Grafana charm. When the proxy charm went into an errored state the only way to fix it was to force remove the charm. I force removed the proxy charm, but then Grafana’s hook errored out, not because of a bug in Grafana, but because the charm’s Juju agent was eternally trying to respond to a relation hook event on a relation that no longer existed, since the force removal of the proxy.
My only recourse was to force remove Grafana. If that Grafana charm had been related to a database, the database charm probably would have errored and had to be force removed as well, and so on.
That only happened in a dev environment but that is a very scary thing to put into production. One unforeseen situation in charm code could necessitate the removal of my entire production stack. And that’s not all.
Automatic Machine Removal
By default when you remove a unit from a machine, it removes the machine as well if that was the last unit on the machine. Say that unit was the last unit of my database cluster. Juju’s default behavior is then to destroy all of the data that was stored on that machine’s disk. If I had to force remove an application because of an error state cascade I could easily forget to first backup my data and Juju would not just remove the charms, but the machine and all of the charm’s data, permanently. ( Obviously I should have backups but that is never something we can assume a user has and just delete their stuff. )
To counteract this default, we at KatharosTech actually provision null charms on every host as a safety mechanism. These charms will reside on each machine and therefore prevent Juju from removing the machine automatically without me explicitly removing the null charm.
Juju should at least allow you to configure this, or maybe provide a way to have charms specify that they store persistent data that shouldn’t be destroyed without user confirmation.
Barriers to Charm Development
If Juju is to be useful, you have to have charms to deploy. Obviously there is the charm store to go and download charms that people have made, but chances are, if you are not using Juju specifically because you want to deploy an app stack you found on the store, you will need to write your own charms for your applications at some point. If people are going to use Juju, then, we have to make it as easy as possible not just to use, but also to make charms.
Here are some obstacles we’ve noticed to writing charms:
Charming Documentation
The charming documentation while actually somewhat well filled out, is still hard to approach for beginners to charming. Here are a couple of causes to that:
- A little bit of a split between the reactive framework and the hook-based model ( though lots of the older documentation has been appropriately marked as such )
- A lot of context is needed to understand charming that can’t really be found in just one place
The second point is probably the most important. There is a lot of investigation into charming that I had to do, reading through the hook tools documentation and such, before I actually understood how the charms worked. And it took a really long time ( relatively ) before I fully understood relations, something that essentially can’t make sense until you start actually using them.
The reactive framework and its Python and Bash variants further confused things. The reactive framework seems too magical. I couldn’t figure out where the database settings were coming from in the Python examples and I couldn’t figure out why some of the commands for the bash version were written in snake_case
and some were written in kebab-case
.
Also it wasn’t clear right off that if you wanted to write bash charms you really had to scour the charms that you wanted to relate to for the information necessary to relate to them because they only had documentation for reactive relations.
Charms Were Slow To Deploy
I also noticed when I first started following tutorials and getting into charm development that deploying and relating charms was slow. OK, not really that slow, but for someone used to using containers for everything, which start up very quickly, charms seemed super slow. Also, after the app was installed, just relating and un-relating the charms was taking a long time doing, according to the logs, nothing but printing trace messages. ( Thankfully, this has been fixed with a great effort by @jameinel to speed up the hook tools ).
These experiences with the documentation and slow deployment led us to make the Lucky charming framework to help address those issues.
How Lucky Tries to Address Charming Barriers
The goal of Lucky was to help provide a well documented and cohesive way to write charms that were fast to deploy.
Documentation
To help the documentation issue, Lucky consolidates everything you need to write charms into a single CLI with all of the commands documented both online and in an embedded CLI documentation viewer. All of the commands you need to interact with Juju and the charming system are in one CLI, so that you hardly need to go anywhere other than the Lucky documentation to learn to write charms ( though we still link to the Juju documentation where appropriate ). That way, in your charm code, you can clearly tell when you are interacting with Juju and the charming system, whenever you run a lucky
command.
Bash
The other thing was, we didn’t want you to have to write Python. As a system administrator with 5+ years writing containers, Bash has constantly been the tool for automating system installations and tasks. Most everything you need to do is accomplished by running commands on the system and you shouldn’t need anything outside of that and some control flow with if statements and loops to accomplish the installation and management of system applications.
Python is not bad in any respect, and it has been a consideration we may put into practice later that Lucky should support Python scripting, but you shouldn’t need it and Bash should not be considered second class. Supporting Bash first-class opens the up the charming world to a wider audience and lowers the barrier to entry with a language that any system administrator who knows how to install an application is probably already comfortable with.
Docker
Docker is crucial to maximizing charm deployment speed and simplifying the charming system by eliminating the need for layers. By using Docker containers you provide a standardized way to package applications that works across Linux distributions. Additionally, almost all popular apps nowadays have Docker containers written by the application maintainers or another respected software organization. This offloads a major portion of the engineering for a large collection of applications onto the application’s maintainers and off of the charm developer.
This comes with benefits to the deployment speed of charms as well. Charms take only as long to start as the container takes to download ( plus some housekeeping by Juju ). This greatly improves the user friendliness of deploying charms and the developer friendliness of developing and re-deploying charms many times. It is a big productivity booster.
Docker is arguably an increased barrier to entry because you have to learn Docker, but Lucky does not require you to use Docker. It has a full integration with Docker as a tool that can boost your charm development productivity, and it provides a way to turn off the Docker integration when you don’t need it.
Helpful Utils
Because you are writing Bash instead of Python, you don’t necessarily have a massive standard library at your disposal, so Lucky comes with some utilities for generating things like random passwords, finding random available ports on the system, and storing state in a local key-value store. It also comes with a built-in system-independent cron scheduler.
Platform Independence
Lucky does not depend on any system tools other than Bash. It does use platform specific procedures for installing Docker on the target platform ( only Ubuntu so far ), but that is easy to extend to other operating systems such as Centos. Additionally, if Python support is added to Lucky, Lucky will have the CPython interpreter and standard library built-in with no dependency on the system version of Python, greatly reducing the obstacles involved with porting charms to be multi-platform.
That combined with Docker for encapsulating the application installation should make it feasible to deploy one charm on Centos and Ubuntu seamlessly.
Remaining Issues
We believe that Lucky has taken a large step in the direction of addressing some of our biggest issues with Juju and has allowed us to start developing and using charms with success so far. We have a list of charms now made with Lucky on the charm store all of which have been put to use ( not full production yet, but going to be ).
Still, there are things that are non-optimal, some of which may require large changes to Juju itself.
Charm Store Rejects Cross-Distro Charms
I has been discussed a few times before that you are not allowed by the charm store policy to push a charm that supports both Centos and Ubuntu at the same time. This is essentially counteracting a goal of Juju by store policy!
If the user is innovative enough to do it, they should be allowed to deploy charms that support both distros. Just because the reactive framework doesn’t support Centos doesn’t mean that other users can’t produce frameworks that do or else just write plain ol’ bash and check the distro before doing anything distro specific.
If you are really, really set on it being a quality control problem where people are publishing broken charms to the store, then don’t let them make it public and let them use it privately. I couldn’t even publish a null charm that didn’t do anything for both Centos and Ubuntu, so what did I do, I just skipped Centos support like almost every other charm on the charm store.
Versioned And Formally Defined Charm Interfaces
A major problem with the charming ecosystem is the fact that interfaces are not versioned nor are they formally defined.
For instance, I created a charm for the RethinkDB database and I developed the reql
relation interface as a way to get the info charms would need to connect to the database such as username and password. Now there are two issues: how to document it, and how to change it.
Documenting Interfaces
Until very recently, interfaces had absolutely no standard way of being documented. It was expected that people would write reactive layers to officially support an interface, but that restricts their use to only Python and Reactive charms ( which don’t run on Centos! ). That is closing off the community and now not any charm can talk to any other charm!
A recent attempt at addressing this has been the #docs:interfaces category on the forum. This is the best thing for interface documentation yet, but I still feel like having a formal definition of a charming interface, maybe with a YAML interface definition or something, to really solidify officially how charm interfaces should be documented.
Related to this, it would be very good to have some form of type checking on this formal definition that could warn you when you are breaking the rules for that relation interface, helping the charm developer and preventing bugs.
This is a large topic as there are a lot of ways to approach it, but I think that solving the issue is essential to a solid charm ecosystem.
Versioning Interfaces
The next issue you have is that, even if you document your interfaces, you cannot version them. Juju and the charm store essentially builds a sort of “package manager” for full blown applications, but what would a package manager be without package versions!
With my reql
interface I designed, there is currently no way to join with any user other than the admin user. That is something I want to add to the interface, but what if that feature necessitated a breaking change to the relation key-value procedure. If I just up and changed the way that the interface worked, other charms on the store would just stop working with my charm with absolutely no explanation for what went wrong.
The only way to avoid this today is to name interfaces with a version in them manually, such as reql2
which is not a pattern I have seen suggested anywhere and yet is the only way I can imagine to make sure you don’t arbitrarily break relations with the way interfaces are setup today. This is another point where the community has no clear documentation or guidelines and where Juju has no constructs to enforce structure and compatibility.
Key-Value Relation System Seems Flawed
That brings us to the way that relations are structured. I think most can probably agree that relations are Juju’s most powerful and confusing aspect of charms. They are amazing and the key to Juju’s ability to organize application deployments, but they are a bitter pill to swallow when you are trying to use them in your own charms for the first time.
Relations Are Difficult to Learn
The documentation for them suffers from maybe ( in overly figurative terms ) the worst plague in project documentation: all of the information is there you just can’t understand how it effects you practically.
The relation documentation is quite complete, actually ( not counting app relations which just came out recently ), but the problem is that it is very hard from that documentation to figure out how relations pan out practically in a project.
Only when you start messing with the relations, probably do some experimenting with juju-run
, and start writing charms that both provide and use relations, do you actually start to figure out how they work.
Also, with the advent of app-scope relations, which solve a great, practical use-case, things get even more complicated. There are even more rules for when one relation can do what or get to what information when and which hooks are triggered, etc. . It was difficult for me to distill, and I am an experienced software engineer.
Yes, I figured it out, I taught my colleague, and I wrote a tutorial about how to do it all with Lucky and it actually wasn’t super complicated, but it was hard to get to that point.
Shared Key-Value Pairs Seems Like a Bad Practice
The other thing about relations is that the concept of shared key-value store and hook-based discovery of changes feels like a bad practice. Not only is is confusing to learn, but bleeds into increasing the chances that your charm is going to run into an unforeseen situation that causes a charm error and, potentially, the irrecoverable error cascade discussed earlier.
Even though both sides of the relation can only change their own side of the relation data, they are still sharing state and expecting each-other to modify the state in a predictable manner in response to each-other’s modifications. That setup, subjectively, sounds suspiciously like a concurrent programming anti-pattern that is prone to mistakes that can only be discovered by running the application. Hopefully you run the application and work out all of the mistakes before pushing it to production!
That is a little overstated, though, because that is to a certain, unavoidable extent, the nature of distributed system automation.
A Way Out? - The Actor Model
So, here is an idea that would take potentially monumental changes to Juju to support fully, but I think the idea has merit.
The Actor model is a concurrent programming model in which each actor is solely responsible for its own state, which it shares with no one. Actors accomplish work by communicating with each-other by sending messages. The Actor model is a powerful foundation for building reliable distributed systems and it is, importantly, very easy to understand.
If we structured charms as actors, the Juju controller would send the charms messages whenever the controller needed to notify the charm of something such as changed charm configuration. If the charm needed to communicate to another charm over a relation, it would send a message. If the other charm needed to respond back, it would send another message. It is very easy to think about and explain to new users.
As for the messages, they could probably be simple key-value documents with a required type
field that would be a unique identifier for the type of message that is being sent. Maybe they could be JSON documents and store more nested data. Also, the interface would be simple to document and type-check. Each interface would define the types of messages that it needs to handle on both the providing and the requiring sides and the types of fields that each type of message would have. We could define a simple YAML schema definition with versions for the schema.
Also, we would most likely need to create a simple “schema package manager”. A centralized ( yet optionally self-hostable ) location where you push your schema definitions with their names and versions, very similar to npm
, pypi
, crates.io
, etc. . Whoever pushes the package first gets the name and we can collaborate on the schemas like a typical Open Source package. That way we can unify the community around formal definitions that set a standard for which charms can reliably communicate with each-other.
Also, the actor model gets rid of the limitations of having to fit into Juju’s own hook system and only operate within those hooks. That was something that was a little difficult to break out of with Lucky, when we needed to implement a cron scheduler.
Footnote: Another potentially important aspect of the Actor model is that actors should be able to skip messages for later so that they can act as a Finite State Machine ( FSM ). This helps to make the charm life-cycle easier to reason about and make more reliable.
Not a Silver Bullet
Obviously it isn’t like the Actor model would just magically solve our problems. A big issue right now is that the only way to attempt something like that would inherently eliminate the ability for any charms not built with the actor model to communicate with charms that are built with the actor model.
I’m not sure if I understood correctly, but I think that the Operator framework actually has this problem? That it only communicates well with other Operator charms? I may have misunderstood the work-in-progress getting-started doc I was reading on it. I would like some clarification on this.
Either way, I think that the way relations are structured today is sort of a problem from both a technical and a good-for-beginners perspective, but I don’t know that Juju can necessarily adapt to a new model without alienating the other charming frameworks and dividing the charm community, which would be tragic ( if it isn’t already going to happen with the Operator framework ).
Also, maybe the Actor model isn’t even a good idea. We think it is, but it hasn’t been tested. In the time that we are able to allocate for it, we are experimenting with our own actor model implementation that may one day find itself in Lucky or some other form of experimental charming framework, but we don’t know what the future looks like for that. We want to get some example of what Actor charms could look like and whether or not it is an effective strategy for writing charms.
Closing Thoughts
Forgive me for such a long post, but I wanted to be thorough. If you got this far, thank you for reading.
To reiterate, this was not written just to bash on Juju. This was written so that Juju and the community could benefit from hearing our honest opinion about the shortcomings that we have found in what is really a one-of-a-kind and amazing tool. If we didn’t think so, we actually wouldn’t spend the time to break this down like this!
I would encourage anybody who has any thoughts to quote the relevant pieces of this post, if any, and reply with their honest opinion. That is how Juju will improve, not by pretending that it is perfect when it isn’t, but by addressing what needs help. I fully hope that this will be helpful to Juju.