Pod Priority and Affinity in juju/charms

Hi,

I was wondering what the approach was with regards to affinity and anti-affinity of containers within juju. I had a couple of questions:

  • How can I set affinity/anti-affinity in juju?
  • How can I set pod priority in juju?
  • Would it be the charm a charm option or a juju option?
  • How can I create a charm which takes advantage these features?

Cheers for the help
Peter

One mechanism via which affinity/anti-affinity is currently supported via tag constraints.

juju deploy foo --constraints="tags=foo=a|b|c,^bar=d|e|f"

would result in a pod with an node selector expression of “foo in a|b|c” and "bar not in “d|e|f”.

Zone constraints, if specifed, map to a node selector with key
failure-domain.beta.kubernetes.io/zone

Currently, constraints are a deploy time option; Juju doesn’t support the concept of charm constraints as a thing (although we have talked about it at various times).

Pod priority is simply a vale in the pod spec

...
kubernetesResources:
  pod:
    priorityClassName: system-cluster-critical
    priority: 2000000000

Great thanks @wallyworld!

Hello @wallyworld,

I’d like to ask something related to the affinity and anti-affinity rules. Apparently, the constraints allow us to implement (anti-)affinity rules at the kubernetes node level.

If you see this documentation, you will see that there’s another level of (anti-)affinity at the inter-pod level. To, for instance, have different pods of the same application (units) allocated in different nodes.

Is this podAntiAffinity field possible in the current podSpec?

apiVersion: v1
kind: Pod
metadata:
  name: with-pod-affinity
spec:
  affinity:
    podAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: security
            operator: In
            values:
            - S1
        topologyKey: topology.kubernetes.io/zone
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        podAffinityTerm:
          labelSelector:
            matchExpressions:
            - key: security
              operator: In
              values:
              - S2
          topologyKey: topology.kubernetes.io/zone
  containers:
  - name: with-pod-affinity
    image: k8s.gcr.io/pause:2.0

At the moment, only node affinity is supported via Juju constraint tags.
eg

juju deploy myapp --constraints "tags=foo=bar|bar2,^baz=bar"

would create a pod with node affinity (hopefully no typos)

  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: foo
            operator: In
            values:
            - bar
            - bar2
          - key: baz
            operator: NotIn
            values:
            - bar

An initial thought is that we could look to extend the syntax to account for “node” and “pod” affinity

juju deploy myapp --constraints "tags=node.foo=bar,pod.baz=bar2"

(default would be “node”).

We’d need to think it through to be comfortable with the syntax etc; there’s limited options available whilst preserving compatibility with what can be supported.

@wallyworld I think, in k8s environment, pod anti-affinity is way more important and more common than node anti-affinity, so it’d be great if it is supported as soon as possible.

In the mean time, is there any workaround for this? May be hard-coding the affinity rules inside charms, perhaps?

Juju does support both node and pod affinity - k8s constraint tags have been extended to support pod affinity as well as node affinity

Use the “pod.” or “anti-pod.” prefix for pod affinity and pod anti affinity. As with space constraints, use the “^” syntax for anti-affinity.

No prefix or “node.” is for node affinity

eg

juju deploy somecharm --constraints="tags=node.foo=a|b|c,^bar=d|e|f,^foo=g|h,pod.foo=1|2|3,^pod.bar=4|5|6,anti-pod.afoo=x|y|z,^anti-pod.abar=7|8|9"

kubectl get -o json statefulset.apps/somecharm | jq .spec.template.spec.affinity
{
  "nodeAffinity": {
    "requiredDuringSchedulingIgnoredDuringExecution": {
      "nodeSelectorTerms": [
        {
          "matchExpressions": [
            {
              "key": "bar",
              "operator": "NotIn",
              "values": [
                "d",
                "e",
                "f"
              ]
            },
            {
              "key": "foo",
              "operator": "NotIn",
              "values": [
                "g",
                "h"
              ]
            },
            {
              "key": "foo",
              "operator": "In",
              "values": [
                "a",
                "b",
                "c"
              ]
            }
          ]
        }
      ]
    }
  },
  "podAffinity": {
    "requiredDuringSchedulingIgnoredDuringExecution": [
      {
        "labelSelector": {
          "matchExpressions": [
            {
              "key": "bar",
              "operator": "NotIn",
              "values": [
                "4",
                "5",
                "6"
              ]
            },
            {
              "key": "foo",
              "operator": "In",
              "values": [
                "1",
                "2",
                "3"
              ]
            }
          ]
        },
        "topologyKey": ""
      }
    ]
  },
  "podAntiAffinity": {
    "requiredDuringSchedulingIgnoredDuringExecution": [
      {
        "labelSelector": {
          "matchExpressions": [
            {
              "key": "abar",
              "operator": "NotIn",
              "values": [
                "7",
                "8",
                "9"
              ]
            },
            {
              "key": "afoo",
              "operator": "In",
              "values": [
                "x",
                "y",
                "z"
              ]
            }
          ]
        },
        "topologyKey": ""
      }
    ]
  }
}

Awesome, thanks.

I tried using the syntax as you mentioned and it worked great. The problem now is updating tags constraints for a running application does not change the corresponding affinity rules in the k8s specs. Can you look at this, please @wallyworld? I’m using juju 2.9.12 btw.

The semantics of constraints is that they are used to influence the deployment of new units of an application; they do not affect existing units.
Given that you will be experiencing a rolling update of the workload pods anyway, a workaround approach that could work for now is to update the constraints, scale down by 1 and then scale back up again (or visa versa). Any change in scale will cause Juju to recreate the pod spec (including affinity) and patch the statefulset pod spec template. We need to look at how best to model these sorts of dynamic pod attributes moving forward.

Scaling application gave me this error:

$ kubectl -n test describe sts mongodb-k8s
...
Events:
Type     Reason        Age                  From                    Message
----     ------        ----                 ----                    -------
Warning  FailedCreate  13m (x29 over 3h5m)  statefulset-controller  create Pod mongodb-k8s-0 in StatefulSet mongodb-k8s failed error: Pod "mongodb-k8s-0" is invalid: [spec.affinity.podAntiAffinity.requiredDuringSchedulingIgnoredDuringExecution[0].topologyKey: Required value: can not be empty, spec.affinity.podAntiAffinity.requiredDuringSchedulingIgnoredDuringExecution[0].topologyKey: Invalid value: "": name part must be non-empty, spec.affinity.podAntiAffinity.requiredDuringSchedulingIgnoredDuringExecution[0].topologyKey: Invalid value: "": name part must consist of alphanumeric characters, '-', '_' or '.', and must start and end with an alphanumeric character (e.g. 'MyName',  or 'my.name',  or '123-abc', regex used for validation is '([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9]')]

The topologyKey field is an empty string (""), which is fine during creation, but yield a validation error during update.

If you need to set topology key, you can add that to the constraint:

pod.topology-key=some-key

or

anti-pod.topology-key=another-key

Thanks @wallyworld!

I have a microk8s cluster made up of multipass VMs, which I add-k8s to juju.

$ multipass exec node-0 -- kubectl get nodes
NAME     STATUS   ROLES    AGE   VERSION
node-1   Ready    <none>   20h   v1.29.2
node-0   Ready    <none>   20h   v1.29.2

I managed to tell juju to deploy each unit of an app to a different node as follows:

  1. Use app.kubernetes.io/name for the anti-pod key (not app; cf. .spec.selector.matchLabels).

  2. After the statefulset is created, manually edited out (kubectl -n cos edit sts prom) the seemingly empty nodeAffinity section.

juju deploy --trust prometheus-k8s prom --num-units=2 \
  --constraints="tags=anti-pod.app.kubernetes.io/name=prom,anti-pod.topology-key=kubernetes.io/hostname"
-nodeAffinity:
- requiredDuringSchedulingIgnoredDuringExecution:
-   nodeSelectorTerms:
-     - {}
podAntiAffinity:
  requiredDuringSchedulingIgnoredDuringExecution:
    - labelSelector:
        matchExpressions:
          - key: app.kubernetes.io/name
            operator: In
            values:
              - prom
      topologyKey: kubernetes.io/hostname

And voila:

$ multipass exec node-0 -- k -n cos get po -o wide

NAME                            READY   STATUS    NODE  
modeloperator-bfc5758ff-p96mj   1/1     Running   node-0
prom-1                          2/2     Running   node-0
prom-0                          2/2     Running   node-1

I wonder if there’s a way to tell juju not to render the nodeAffinity to begin with?

If not removed, then the units are stuck in waiting/allocating:

Unit    Workload  Agent       Address  Ports  Message
prom/0  waiting   allocating                  installing agent
prom/1  waiting   allocating                  installing agent
1 Like

Worth raising a bug - should be a simple fix.