Advanced scheduling on Kubernetes with Juju and charmed operators

Abstract

Operating production-ready Kubernetes clusters and making them efficient at scale is no small task. Balancing governance, performance, and costs requires careful planning, especially when user-driven workloads come into the equation. In this post, we will present how we address part of this complex challenge through the prism of advanced scheduling: keeping control of where pods are scheduled and executed in the cluster, and how Juju and charms can make this easy.

Introduction

Kubeflow and Apache Spark are two popular analytics tools commonly found in production Kubernetes clusters, especially with the rise of AI and ML workloads. By their very nature, they exacerbate the challenges of operating Kubernetes at scale:

  • Analytic workloads benefit from specialized hardware such as high-memory nodes, custom hardware resources such as GPUs, or specific architecture, while we want control-plane operation to remain stable on cost-effective instances.
  • They require the flexibility to scale down idle resources to keep costs constrained, especially if they need costly specialized hardware. On the other hand, parallel processing helps to achieve high velocity and low latency: this is where scaling out resources is key.
  • Resource utilization is highly volatile: a single job completion can release a considerable amount of resources, while cold starts impair interactive analyses. Flexibility is not enough; speed in resource allocation and deallocation is also required.
  • Decoupling those heavy workloads from control-plane operations is crucial to not starving system services for CPU, memory, and IO, therefore keeping them stable.

For all the reasons above, it is recommended to run analytics workloads in dedicated worker pools. Today I’m joined by @mattiasarti to present a few strategies to use advanced scheduling on Kubernetes clusters.

Advanced scheduling

Prerequisites

Let’s create a K8s cluster with multiple node pools, each one meant for different workloads, with respective (sets of) node labels and taints for all of them but the one for Juju-system workloads:

  • One for K8s-control-plane workloads
  • One for Juju-system workloads (such as controllers and model operators), the only unlabeled and untainted one
  • One for Kubeflow-platform workloads
  • One for Spark-platform workloads (for arm64 charms)
  • One or more for default Kubeflow-user workloads (defaults selected for each user Profile)
  • One or more for customized Kubeflow-user workloads (non-default, explicitly configured)
  • One or more for Spark-user workloads (arm64)

Achieving, for instance, this node state — with a single node for each pool:

kubectl get nodes
NAME                                STATUS   ROLES    AGE     VERSION
aks-a-33719789-vmss000000           Ready    <none>   7d8h    v1.32.10
aks-b-75544582-vmss000000           Ready    <none>   7d7h    v1.32.10
aks-juju-38601234-vmss000000        Ready    <none>   7d8h    v1.32.10
aks-k8s-27176324-vmss000000         Ready    <none>   7d8h    v1.32.10
aks-k8s-27176324-vmss000001         Ready    <none>   7d8h    v1.32.10
aks-kubeflow-38183491-vmss000000    Ready    <none>   7d8h    v1.32.10
aks-spark-11830284-vmss000000       Ready    <none>   25h     v1.32.10
aks-sparkjobs-28125035-vmss000000   Ready    <none>   7m21s   v1.32.10
aks-x-76501254-vmss000000           Ready    <none>   7d7h    v1.32.10

Node name prefixes represent their pool’s purpose, being for Kubeflow users “a” and “b” different default pools and “x” a special-hardware one.

For Charmed Kubeflow, the depicted schema is generally achievable, despite the fact we will keep Kubeflow-platform workloads on a single node pool in this article:

Deploying CKF

Once all node pools are set up, we bootstrap our Juju controller, which will automatically land on the node pool for Juju-system workloads (being the only untainted one).

To control where to schedule pods, we then make (heavy) use of our Namespace Node Affinity Operator, a charm that operates a MutatingAdmissionWebhook to inject pods with desired node affinity and tolerations in a way that is configurable to the charm user for each namespace, to target desired node pools for any Juju models of interest. First off, we create a Juju model to deploy the Namespace Node Affinity Operator and we configure it (via its charm config option) to inject pods scheduled to the namespace of the Kubeflow platform with node affinity and tolerations respectively for node labels and taints for the Kubeflow-platform node pool. An example may be:

juju add-model namespace-node-affinity
juju switch namespace-node-affinity

juju deploy --trust --channel 2.2/stable --revision 248 namespace-node-affinity
juju wait-for application namespace-node-affinity

namespace_node_affinity_settings=$(cat << EOF
kubeflow: |
  excludedLabels:
    exclude-me-from-namespace-node-affinity-operator: "true"
  nodeSelectorTerms:
    - matchExpressions:
      - key: kubeflow-platform
        operator: Exists
  tolerations:
    - effect: NoSchedule
      key: kubeflow-platform
      operator: Exists
EOF
)
juju config namespace-node-affinity settings_yaml="$namespace_node_affinity_settings"

The pods of this operator itself will be randomly scheduled to any node pool free of taints — in this case the one for Juju-system components, being the only untainted one. If one prefers to have them in the same namespace as the Kubeflow platform, a latter instance of the same operator can be instantiated, so that it is in turn injected by the former one to have it scheduled to the desired node pool, eventually removing the former instance once the latter is deployed and configured, keeping the node pool for Juju-system components free from anything that is not Juju controller pods or model-operator pods.

At this point, we create the Juju model for the Kubeflow platform and we label its corresponding namespace to allow the Namespace Node Affinity Operator to affect pods in that same namespace by running:

juju add-model kubeflow
kubectl label namespaces kubeflow namespace-node-affinity=enabled

Mind that model-operator pods, responsible for managing the lifecycle of a particular model, will always be created before models’ namespaces are labeled to enable Namespace Node Affinity Operator and will never be injected with node affinity and tolerations, therefore ending up in the unlabeled and untainted node pool for Juju-system components, as desired.

At this point, we deploy Charmed Kubeflow in any way we prefer, without specifying any extra constraints, and all Kubeflow-platform pods will automatically be injected with node affinity and tolerations to land on the Kubeflow-platform namespace.

It is worth noting that, among the charms of the Kubeflow platform, one is in charge of creating namespaces for Kubeflow users (technically, “Kubeflow Profiles”), where user workloads (the actual “machine learning stuff”) are run, and such namespaces will already be created with labels to enable the Namespace Node Affinity Operator to modify pod requests to those same namespaces. Therefore, we can then extend the configurations of the Namespace Node Affinity Operator to configure node affinity, tolerations and exclusion labels for namespaces corresponding to (not-yet-created) Kubeflow Profiles so that each Profile can target the respective desired default node pool for its user workloads. For example:

namespace_node_affinity_settings=$(cat << EOF
kubeflow: |
  excludedLabels:
    exclude-me-from-namespace-node-affinity-operator: "true"
  nodeSelectorTerms:
    - matchExpressions:
      - key: kubeflow-platform
        operator: Exists
  tolerations:
    - effect: NoSchedule
      key: kubeflow-platform
      operator: Exists
profile-i: |
  excludedLabels:
    exclude-me-from-namespace-node-affinity-operator: "true"
  nodeSelectorTerms:
    - matchExpressions:
      - key: kubeflow-default-node-pool
        operator: In
        values:
          - a
  tolerations:
    - effect: NoSchedule
      key: kubeflow-default-node-pool
      operator: Equal
      value: a
profile-j: |
  excludedLabels:
    exclude-me-from-namespace-node-affinity-operator: "true"
  nodeSelectorTerms:
    - matchExpressions:
      - key: kubeflow-default-node-pool
        operator: In
        values:
          - a
  tolerations:
    - effect: NoSchedule
      key: kubeflow-default-node-pool
      operator: Equal
      value: a
profile-k: |
  excludedLabels:
    exclude-me-from-namespace-node-affinity-operator: "true"
  nodeSelectorTerms:
    - matchExpressions:
      - key: kubeflow-default-node-pool
        operator: In
        values:
          - b
  tolerations:
    - effect: NoSchedule
      key: kubeflow-default-node-pool
      operator: Equal
      value: b
EOF
)
juju config namespace-node-affinity settings_yaml="$namespace_node_affinity_settings"

Running CKF workloads

We can finally create Kubeflow Profiles and Kubeflow users can start using the Kubeflow platform to run workloads, automatically seeing them scheduled to the default node pools of their respective Profiles. For instance, creating a Notebook from the Dashboard results in it being scheduled to node pool A if we do from Profile I and to node pool B if we do from Profile K:

kubectl get pods -o wide -n profile-i
NAME                                               READY   STATUS    RESTARTS   AGE   IP             NODE                        NOMINATED NODE   READINESS GATES
ml-pipeline-ui-artifact-6d575b7484-l7k64           2/2     Running   0          8h    10.244.4.100   aks-a-33719789-vmss000000   <none>           <none>
ml-pipeline-visualizationserver-68fd78454f-58jcc   2/2     Running   0          8h    10.244.4.118   aks-a-33719789-vmss000000   <none>           <none>
notebook-in-default-profile-node-pool-0            2/2     Running   0          27s   10.244.4.56    aks-a-33719789-vmss000000   <none>           <none>
kubectl get pods -o wide -n profile-k
NAME                                               READY   STATUS    RESTARTS   AGE     IP             NODE                        NOMINATED NODE   READINESS GATES
ml-pipeline-ui-artifact-6d575b7484-jv8lm           2/2     Running   0          11h     10.244.5.117   aks-b-75544582-vmss000000   <none>           <none>
ml-pipeline-visualizationserver-68fd78454f-48d8j   2/2     Running   0          11h     10.244.5.60    aks-b-75544582-vmss000000   <none>           <none>
notebook-in-default-profile-node-pool-0            2/2     Running   0          2m41s   10.244.5.153   aks-b-75544582-vmss000000   <none>           <none>

In case we want to target different node pools than the default one for specific workloads, for example to select special hardware, we can define individual workloads with i) labels to selectively disable the configured behavior of Namespace Node Affinity Operator, ii) node affinity to the desired node pool’s labels and iii) tolerations for the desired node pool’s taints. In other words, we can always opt-out from the default manipulation of pod requests, to define our own node affinity and tolerations, at the level of single workloads. For example, still to create a Notebook from the Dashboard, but this time on some node pool labeled and tainted for special hardware representing ARM64 CPU architecture, we can select preconfigured PodDefaults (to add labels), node affinity and tolerations for the Notebook’s workload:

kubectl get pods -o wide -n profile-i
NAME                                               READY   STATUS    RESTARTS   AGE   IP             NODE                        NOMINATED NODE   READINESS GATES
ml-pipeline-ui-artifact-6d575b7484-l7k64           2/2     Running   0          8h    10.244.4.100   aks-a-33719789-vmss000000   <none>           <none>
ml-pipeline-visualizationserver-68fd78454f-58jcc   2/2     Running   0          8h    10.244.4.118   aks-a-33719789-vmss000000   <none>           <none>
notebook-in-special-hardware-node-pool-0           2/2     Running   0          32s   10.244.6.83    aks-x-76501254-vmss000000   <none>           <none>

Where “Configurations” (PodDefaults), “Affinities” and “Tolerations” could be preconfigured for the Dashboard by running:

affinity_options=$(cat << EOF
- configKey: special_hardware_x
  displayName: "Special Hardware X"
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: special-hardware
                operator: In
                values: [x]
EOF
)
tolerations_options=$(cat << EOF
- groupKey: special_hardware_x
  displayName: "Special Hardware X"
  tolerations:
    - key: special-hardware
      operator: Equal
      value: x
      effect: NoSchedule
EOF
)
juju config jupyter-ui affinity-options="$affinity_options"
juju config jupyter-ui tolerations-options="$tolerations_options"

for profile_name in profile-i profile-j profile-k;
do
kubectl apply -f - << EOF
apiVersion: kubeflow.org/v1alpha1
kind: PodDefault
metadata:
  name: disable-namespace-node-affinity-operator
  namespace: $profile_name
spec:
  desc: "Disable namespace-node-affinity-operator"
  selector:
    matchLabels:
      disable-namespace-node-affinity-operator: "true"
  labels:
    exclude-me-from-namespace-node-affinity-operator: "true"
EOF
done

This is it for Kubeflow profiles, let’s move on to Spark with alternative scheduling strategies.

Deploying Charmed Apache Spark charms on arm64

First, we create a new Juju model with

juju add-model spark

Thanks to the existing Kubernetes cluster setup, the modeloperator pod will again be scheduled on the Juju-system node pool. While we could reuse the previously deployed Namespace Node Affinity Operator to integrally handle the deployment of charms, both affinities and tolerations, let’s see how we can achieve the same result using Juju for affinities. We still add the following namespace configuration to the Namespace Node Affinity operator to inject tolerations, as this is not supported by Juju (after appropriately labeling the model namespace):

spark: |
  tolerations:
  - key: spark-platform
    operator: Equal
    value: true
    effect: NoSchedule

We then deploy the Integration Hub for Apache Spark charm using

juju deploy spark-integration-hub-k8s --channel 3/edge hub \
--constraints="arch=arm64 tags=spark-platform=true" --trust

On a K8s substrate, the arch constraint gets translated to a nodeselector and the tags one acts as a shorthand notation for affinities. While for demonstration purposes we are keeping things simple in this blog post, tags can be used for significantly more advanced scheduling, e.g., spreading replica units of an application on different nodes or availability zones using anti-affinity rules.

Let’s check that our charm was indeed deployed on the node we wanted:

kubectl get pods -n spark -o wide
NAME                            READY   STATUS    RESTARTS   AGE     IP             NODE                            NOMINATED NODE   READINESS GATES
hub-0                           2/2     Running   0          4h40m   10.244.7.52    aks-spark-11830284-vmss000000   <none>           <none>
modeloperator-bbdd4c788-9zj95   1/1     Running   0          6h22m   10.244.2.153   aks-juju-38601234-vmss000000    <none>           <none>

This is it; our Juju model is spread on nodes of different architectures.

Running Spark jobs in the worker node pool

Products and their workload might also expose their own scheduling mechanism on Kubernetes. Since Spark jobs are executed in the context of a namespace, we could have once again configured the Namespace Node Affinity Operator to schedule jobs where we want. Instead, let us make use of one of Apache Spark’s features on K8s: we can use template files to define the driver and/or executor pod configuration. This section will present how to run Spark jobs on the sparkjobs node . Let’s create a new pod_template.yaml file with the following content:

apiVersion: v1
kind: Pod
spec:
  nodeSelector:
    spark-workloads: true
  tolerations:
    - effect: NoSchedule
      key: spark-workloads
      operator: Equal
      value: true

To run our Spark job, we will need the spark-client snap to create a service account with the appropriate permissions.

sudo snap install spark-client --channel 3.5/edge

spark-client.service-account-registry create --username spark-sa --namespace spark-workloads

spark-client.service-account-registry add-config --username spark-sa --namespace spark-workloads \
  --conf spark.kubernetes.driver.podTemplateFile=pod_template.yaml \
  --conf spark.kubernetes.executor.podTemplateFile=pod_template.yaml

We then submit a job using a bundled example jar:

spark-client.spark-submit --username spark-sa \
--namespace spark-workloads \
--class org.apache.spark.examples.SparkPi \
local:///opt/spark/examples/jars/spark-examples_2.12-3.5.7.jar 100

After a few moments, we can see the job pods running on the appropriate node, thanks to the template file adding the appropriate nodeselector and toleration.

kubectl get pods -n spark-workloads -o wide
NAME                                                        READY   STATUS    RESTARTS   AGE   IP             NODE                                NOMINATED NODE   READINESS GATES
org-apache-spark-examples-sparkpi-ea91f59d25b85b92-driver   1/1     Running   0          46s   10.244.8.172   aks-sparkjobs-28125035-vmss000000   <none>           <none>
spark-pi-09bbaf9d25b8fce6-exec-1                            1/1     Running   0          6s    10.244.8.48    aks-sparkjobs-28125035-vmss000000   <none>           <none>
spark-pi-09bbaf9d25b8fce6-exec-2                            1/1     Running   0          6s    10.244.8.240   aks-sparkjobs-28125035-vmss000000   <none>           <none>

Pod templates can do much more, but hopefully with the examples provided in this blog post you will be equipped on how to use them with Charmed Apache Spark.

Conclusion

Leveraging Kubernetes’ primitives and Canonical’s charmed analytics ecosystem, we demonstrated how a few tools and concepts help us efficiently operate a Kubernetes cluster and the workloads running on top of it, even when user-driven analytics workloads are involved. System services can remain stable on cost-effective instances, while more demanding pods are scheduled on dedicated, possibly auto-scaling worker pools.

Juju the orchestration engine, Namespace Node Affinity Operator, Charmed Kubeflow and Charmed Apache Spark provide a set of tools on top of Kubernetes to help operate analytics workloads on Kubernetes at scale.

Further reading

For more detailed information about the concepts and tools introduced in this blog post, please refer to:

2 Likes