Minimizing CaaS operator pod resource usage

Copied over from https://bugs.launchpad.net/juju/+bug/1840841, since this is a better venue for it.

The background / motivating use case is this: The Kubeflow bundle is comprised of approximately 30 charms, with more slated to get included. Right now, Juju creates a PV/PVC per operator pod, and one operator pod per charm. This runs into issues with the fact that you can only mount so many volumes onto a single node instance with many providers, such as AWS. Concretely, when I spin up a Charmed Kubernetes stack on AWS and deploy Kubeflow to it, any one node can only have 26 volumes mounted on it.

The end result is that I’m unable to scale up with Charmed Kubernetes / Juju, the only option I have is to scale out, since all of the charms can’t fit onto a single node instance. This isn’t the worst thing in the world, but in a microservices world, that can translate to a lot of wasted capacity if you can’t place enough microservices onto a node to maximize utilization.

There’s two solutions that seem like they would solve this issue for me. Neither one seems particularly easy, hence this being a feature request:

  1. The ability to run operator pods in a stateless manner, with no volumes.

    This would probably require some work put into making the operator code idempotent, which won’t be easy, but there are some pretty great benefits in fault-tolerance that this would bring.

  2. The ability to coalesce multiple operators into fewer stateful pods

    This would probably be easier than above, but may run into issues with complexities around logging, multi-threading, etc.

Hey @knkski

I’ve had a quick look into how Kubernetes functions with regards to volume limits.

It looks like possibly this is an issue with CDK not adding the node label beta.kubernetes.io/instance-type for AWS (haven’t confirmed this). For in-tree provisioners, getMaxVolumeFunc shows that it uses beta.kubernetes.io/instance-type for EBS volumes.

Kubernetes scheduler should take into account volume limits when selecting a node for a pod, but in this case it’s just selecting the wrong volume limit due to miss-configuration.

Possibly a temporary solution to this is using a custom StorageClass with the annotation “juju.io/operator-storage” set to “true” and choosing an alternative storage mechanism.

I’m going to bump this again with a slightly different use case (and edit the title): I’ve got Kubeflow deploying to microk8s, and it’s comprised of ~30 microservices. Since each service needs an operator pod, that’s 60-70 pods total to deploy Kubeflow. That doesn’t leave a lot of room with the default pod limits for things like Istio, or other microservice-based deployments. I had someone deploying Kubeflow to microk8s hit the 110 pod limit. Although you can manually edit some args in microk8s to bump up the pod limit per node, having Juju require O(2N) pods (where N is services) is not great. Particularly because charming Istio is going to require a dozen or so more microservices, or a quarter more of the default pod limit.

I have no idea what engineering challenge might lie underneath this, but it would be really nice to allow running multiple Juju operators per operator pod, whether it be grouped together in an M:N situation, or managing multiple services from a single operator, or some other way of minimizing resource usage.

@knkski I’ve had some chats with @wallyworld regarding this and I think we might have some ideas around this. Possibly one solution we can look into is a shared operator for k8s charms, it has it’s own problems but would heavily reduce the overhead. I’ll talk with him more about it soon to see what would be involved from an engineering perspective.