HA Kubernetes on small MAAS cluster (3 nodes)

Hello,

I’m trying to set up Charmed Kubernetes for the first time and I’ve run into some problems which might be bugs, limitations of the charms or misunderstanding on my part, or possibly a bit of each.

Let me give a bit of context first: Presently I’m running 2 node Kubernetes cluster using microk8s. It has 48 cores and 224 RAM in total. I’m using MetalLB for accessing the services from LAN, Emissary Ingress with Cert Manager for HTTP/S traffic dispatch and Rook/Ceph on top of TopoLVM for distributed storage. I’m running Jenkins, Sontatype Nexus some databases and other tools and a bunch of test environments for my small software development team on this cluster. It works, but, we’re starting to run out of resources. The other issue is that when on of the machines runs out of RAM and starts OOMKilling processes the whole thing crashes. Running Ceph on k8s and mounting rbd devices in the kernel of the machine that is running k8s has the downside that when Ceph processes are stopped everything else gets wedged and power cycling the servers is the only way forward. Too bad that on of the servers in this makeshift cluster doesn’t even have a BMC and somebody in the office needs to walk to the broom closet and press the reset button :sob:

I’ve decided to build a bigger and more robust cluster to to replace the one we have. For this purpose, I’ve bought refurbished Dell PowerEdge servers: one R330 (4 vCore @ 3GHz, 32G RAM, 500G SSD) and three R630s (64 vCore @ 2.6HGz, 256G RAM, 2T NVMe). So far, I’ve successfully installed MAAS and Juju controllers in LXD containers on the R330 and figured out how to configure the R630s via MAAS with the desired storage layout and network setup. Juju is able to claim/control/release the machines without problems.

Next steps that I intend to take is deploying Charmed Kubernetes and Ceph with Juju onto 3 machine cluster provided by MAAS. I would like to make this cluster HA in the sense that it should tolerate taking down any 1 of the 3 machines for maintenance for a limited period of time.

I’ve started out with juju deploy charmed-kubernetes without any overlays. The first surprise was that the charm requested 10 machines from MAAS. My plan is to run everything (control plane, storage and workloads) on the 3 large servers. I might add worker nodes in the future but for now, I need to make do with the hardware I have. Adding a physical server just to run an etcd or nginx seems wasteful and besides, I can turn only so much electricity into heat inside a broom closet, before the damn thing turns into a fire hazard :wink: I quickly learned how to define my own bundle that declares a set of machines and places application units on them explicitly.

At this point I’ve realized that that it doesn’t seem to be possible to deploy kubeapi-load-balancer on the same machines as kubernetes-control-plane or kubernetes-worker units. This is because of TCP port conflict. My understanding is kubernetes-control-plane exposes k8s API server on port 6443 and kubeapi-load-balancer also listens on the same port, forwarding traffic to control plane units. kuberntes-workeron the other hand exposes default HTTP/S ingress on ports 80 and 443 and kubeapi-load-balancer also listens on port 443 and, as far as I understand, it forwards the traffic to kubernetes-worker nodes port 443. It does not listen on port 80 though which seems inconsistent.

Then, I’ve found out that I can set ingress=false config option on kubernetes-worker to make port 443 available to kubeapi-load-balancer. I’m fine with this, because I intend to use MetalLB + Emissary Ingress as before for handling HTTP/S anyway.

The next bundle configuration I’ve tried was kubeapi-load-balancer on machine 0, kubernetes-control-plane on machines 0 and 1 and kubernetes-worker on machines 0, 1 and 2. Obviously, this is not a HA configuration I’m after. To make this HA I’d need more control over kubeapi-load-balancer configuration: If I could configure it to listen at a different port, say, 8443 and forward it to kubernetes-control-plane port 6443 I could deploy both control plane and load balancer units to all three machines. Then, I could use keepalived as described in HA for kubeapi-load-balancer | Ubuntu to float a virtual IP of the load balancer among the three machines which seems perfectly adequate for my setup.

So this is the first batch of questions that I would like to ask: Is it possible to configure kubeapi-load-balancer to use different listen and downstream ports? Also, is it possible to disable forwarding port 443 to kubernetes-worker , which I don’t need, while we are at it? I understand I would also have to override the the port in public API server URL that all the components are using. Can I do that?

In the meantime I’ve decided to proceed with 1 load balancer unit + 2 control plane units + 3 worker units setup to experiment with Ceph storage and other cluster add-ons. However, the cluster doesn’t seem to start up correctly. Here’s what the state of the model has settled to:

rafal@stagnum0:~$ juju status
Model        Controller  Cloud/Region  Version  SLA          Timestamp
k8s-on-maas  juju-3-3    maas/default  3.3.0    unsupported  13:39:09+01:00

App                       Version  Status   Scale  Charm                     Channel      Rev  Exposed  Message
calico                             waiting      5  calico                    1.28/stable  101  no       Waiting for Kubernetes config.
containerd                1.6.8    active       5  containerd                1.28/stable   73  no       Container runtime available
easyrsa                   3.0.1    active       1  easyrsa                   1.28/stable   48  no       Certificate Authority connected.
etcd                      3.4.22   active       3  etcd                      1.28/stable  748  no       Healthy with 3 known peers
kubeapi-load-balancer     1.18.0   active       1  kubeapi-load-balancer     1.28/stable   84  yes      Loadbalancer ready.
kubernetes-control-plane  1.28.5   waiting      2  kubernetes-control-plane  1.28/stable  321  no       Waiting for auth-webhook tokens
kubernetes-worker         1.28.5   waiting      3  kubernetes-worker         1.28/stable  134  yes      Waiting for CNI plugins to become available

Unit                         Workload  Agent  Machine  Public address  Ports         Message
easyrsa/0*                   active    idle   0        192.168.3.30                  Certificate Authority connected.
etcd/0                       active    idle   0        192.168.3.30    2379/tcp      Healthy with 3 known peers
etcd/1*                      active    idle   1        192.168.3.31    2379/tcp      Healthy with 3 known peers
etcd/2                       active    idle   2        192.168.3.32    2379/tcp      Healthy with 3 known peers
kubeapi-load-balancer/0*     active    idle   0        192.168.3.30    443,6443/tcp  Loadbalancer ready.
kubernetes-control-plane/0*  waiting   idle   1        192.168.3.31    6443/tcp      Waiting for auth-webhook tokens
  calico/2                   active    idle            192.168.3.31                  Ready
  containerd/2               active    idle            192.168.3.31                  Container runtime available
kubernetes-control-plane/1   waiting   idle   2        192.168.3.32    6443/tcp      Waiting for 3 kube-system pods to start
  calico/3                   active    idle            192.168.3.32                  Ready
  containerd/3               active    idle            192.168.3.32                  Container runtime available
kubernetes-worker/0          active    idle   1        192.168.3.31                  Kubernetes worker running.
  calico/0*                  active    idle            192.168.3.31                  Ready
  containerd/0*              active    idle            192.168.3.31                  Container runtime available
kubernetes-worker/1          waiting   idle   2        192.168.3.32                  Waiting for CNI plugins to become available
  calico/1                   waiting   idle            192.168.3.32                  Waiting for Kubernetes config.
  containerd/1               active    idle            192.168.3.32                  Container runtime available
kubernetes-worker/2*         waiting   idle   0        192.168.3.30                  Waiting for CNI plugins to become available
  calico/4                   waiting   idle            192.168.3.30                  Waiting for Kubernetes config.
  containerd/4               active    idle            192.168.3.30                  Container runtime available

Machine  State    Address       Inst id   Base          AZ       Message
0        started  192.168.3.30  stagnum1  ubuntu@22.04  default  Deployed
1        started  192.168.3.31  stagnum2  ubuntu@22.04  default  Deployed
2        started  192.168.3.32  stagnum3  ubuntu@22.04  default  Deployed

I was able to connect to the k8s cluster and I can see two nodes:

rafal@stagnum0:~$ kubectl get nodes
NAME       STATUS   ROLES           AGE   VERSION
stagnum2   Ready    control-plane   15h   v1.28.5
stagnum3   Ready    control-plane   15h   v1.28.5

However the system workloads don’t look healthy:

rafal@stagnum0:~$ kubectl get services
NAME         TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)   AGE
kubernetes   ClusterIP   10.152.183.1   <none>        443/TCP   15h
rafal@stagnum0:~$ kubectl get pods --all-namespaces
NAMESPACE              NAME                                        READY   STATUS             RESTARTS         AGE
kube-system            calico-kube-controllers-6f698d4f87-z79sz    0/1     CrashLoopBackOff   62 (3m46s ago)   15h
kube-system            calico-node-d762p                           0/1     Running            4 (165m ago)     15h
kube-system            calico-node-tb9sk                           0/1     Unknown            1                15h
kube-system            coredns-59cfb5bf46-p7rz2                    0/1     Pending            0                15h
kube-system            kube-state-metrics-78c475f58b-pdrtt         0/1     Pending            0                15h
kube-system            metrics-server-v0.6.3-69d7fbfdf8-dpcrp      0/2     Pending            0                15h
kubernetes-dashboard   dashboard-metrics-scraper-5dd7cb5fc-wggdv   0/1     Pending            0                15h
kubernetes-dashboard   kubernetes-dashboard-7b899cb9d9-sn5sn       0/1     Pending            0                15h
rafal@stagnum0:~$ kubectl get events -n kube-system
LAST SEEN   TYPE      REASON             OBJECT                                         MESSAGE
10m         Warning   BackOff            pod/calico-kube-controllers-6f698d4f87-z79sz   Back-off restarting failed container calico-kube-controllers in pod calico-kube-controllers-6f698d4f87-z79sz_kube-system(5339f618-0053-45a9-980d-ce9bd80a42b3)
47s         Warning   Unhealthy          pod/calico-node-d762p                          (combined from similar events): Readiness probe failed: 2023-12-26 12:45:03.955 [INFO][55529] confd/health.go 180: Number of node(s) with BGP peering established = 0...
15s         Warning   BackOff            pod/calico-node-tb9sk                          Back-off restarting failed container install-cni in pod calico-node-tb9sk_kube-system(dbe2ac2b-52b2-4c36-a82e-61f3f3ee9fd1)
19s         Warning   FailedScheduling   pod/coredns-59cfb5bf46-p7rz2                   0/2 nodes are available: 2 node(s) had untolerated taint {node-role.kubernetes.io/control-plane: }. preemption: 0/2 nodes are available: 2 Preemption is not helpful for scheduling..
19s         Warning   FailedScheduling   pod/kube-state-metrics-78c475f58b-pdrtt        0/2 nodes are available: 2 node(s) had untolerated taint {node-role.kubernetes.io/control-plane: }. preemption: 0/2 nodes are available: 2 Preemption is not helpful for scheduling..
19s         Warning   FailedScheduling   pod/metrics-server-v0.6.3-69d7fbfdf8-dpcrp     0/2 nodes are available: 2 node(s) had untolerated taint {node-role.kubernetes.io/control-plane: }. preemption: 0/2 nodes are available: 2 Preemption is not helpful for scheduling..
rafal@stagnum0:~$ kubectl logs calico-kube-controllers-6f698d4f87-z79sz -n kube-system --tail=5
2023-12-26 12:46:57.057 [INFO][1] hostendpoints.go 173: successfully synced all hostendpoints
2023-12-26 12:47:01.047 [ERROR][1] main.go 297: Received bad status code from apiserver error=Get "https://10.152.183.1:443/healthz": context deadline exceeded status=0
2023-12-26 12:47:01.047 [INFO][1] main.go 313: Health check is not ready, retrying in 2 seconds with new timeout: 8s
2023-12-26 12:47:11.050 [ERROR][1] main.go 297: Received bad status code from apiserver error=Get "https://10.152.183.1:443/healthz": context deadline exceeded status=0
2023-12-26 12:47:11.050 [INFO][1] main.go 313: Health check is not ready, retrying in 2 seconds with new timeout: 16s

I’ve expected to see three kubernetes nodes provided by kubernetes-worker units, meanwhile I see two unschedulable nodes that appear to represent kubernetes-control-plane units, and they also don’t appear to work right. I was trying to poke around the debug logs and found something that’s probably relevant:

rafal@stagnum0:~$ juju debug-log --include kubernetes-control-plane/0 --no-tail -n 3
unit-kubernetes-control-plane-0: 16:48:48 WARNING unit.unit-kubernetes-control-plane-0.collect-metrics E1226 15:48:48.191384  794230 memcache.go:265] couldn't get current server API group list: Get "https://127.0.0.1:6443/api?timeout=32s": tls: failed to verify certificate: x509: certificate is valid for 192.168.3.31, not 127.0.0.1
unit-kubernetes-control-plane-0: 16:48:48 WARNING unit.unit-kubernetes-control-plane-0.collect-metrics E1226 15:48:48.198407  794230 memcache.go:265] couldn't get current server API group list: Get "https://127.0.0.1:6443/api?timeout=32s": tls: failed to verify certificate: x509: certificate is valid for 192.168.3.31, not 127.0.0.1
unit-kubernetes-control-plane-0: 16:48:48 WARNING unit.unit-kubernetes-control-plane-0.collect-metrics Unable to connect to the server: tls: failed to verify certificate: x509: certificate is valid for 192.168.3.31, not 127.0.0.1

I’ve logged into machines 1 and 2 and found that several kubeconfig files in /root/cdk files do indeed specify API server location as https://127.0.0.1:6443. I think I might have run into a bug because I tried following https://ubuntu.com/kubernetes/docs/install-manual closely and the only things I’ve tweaked was number and location of units and disabling of default ingress, so I’d expect to end up with a functional cluster. If there’s a possible workaround for that please let me know.

If there’s anything I can do help diagnosing this issue, please tell me what I can do. I see benefits of managing my k8s cluster with juju and I am willing to put in the work.

Cheers, Rafał

First of all, sorry for such a long message! I’m pretty sure nobody has the patience to read all that…

Anyway, I am pleased to report that I was able to make progress on most of the issues mentioned above. The most important thing that I was missing was that I can subdivide the physical machines into LXD containers and deploy application units to them. This works out of the box an is as easy as using placement directives juju add-unit <application> --to lxd:0 deploys the unit to a new LXD container launched at machine 0.

Using LXD containers gets around port conflicts because each container gets it’s own IP address. Another benefit is that each containers gets it’s own Juju agent which allows running more management operations in parallel. When units are co-located on a single machine, Juju agents performs queued operations sequentially.

I found out that you cannot co-locate kubernetes-worker and kubernetes-control-plane in the same machine. This might be a naming problem, or maybe some implementation details are clashing, so just confine them in LXD containers instead. You might wonder if you can run containerd and spawn more containers from inside a LXD container but in fact you can!

Another thing I was confused about were the ports of kubeapi-load-balancer. Both 443 and 6443 are forwarded to port 6443 of kubernetes-control-plane. Kubeconfig files generated by kubernetes-control-plane are using port 6443, but you can switch manually to port 443 an it will also work. If think that the load balancer should probably listen on one port, but that’s just a minor wart.