Hi folks,
I have downgraded from Kubernetes 1.16 > 1.15 and used the Juju method for installation.
I deployed the NFS charm and associated it with the Kubernetes worker. However, although this worked flawlessly on 1.16, for some reason the setup seems to result in a NFS protocol mismatch on 1.15 - not sure if this is a bug in the NFS charm, or if there is an easy way to resolve it. (I imagine I just need to change the nfs provisioner definition somehow.)
From kubectl get events
LAST SEEN TYPE REASON OBJECT MESSAGE
5m54s Normal Scheduled pod/nfs-client-provisioner-7497897b88-92lm7 Successfully assigned default/nfs-client-provisioner-7497897b88-92lm7 to juju-e52e83-4
5m53s Warning FailedMount pod/nfs-client-provisioner-7497897b88-92lm7 MountVolume.SetUp failed for volume "nfs-client-root" : mount failed: exit status 32
Mounting command: systemd-run
Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/68f581eb-2895-44ea-b902-b20ed3bdaef7/volumes/kubernetes.io~nfs/nfs-client-root --scope -- mount -t nfs 192.168.54.78:/srv/data/kubernetes-worker /var/lib/kubelet/pods/68f581eb-2895-44ea-b902-b20ed3bdaef7/volumes/kubernetes.io~nfs/nfs-client-root
Output: Running scope as unit: run-r911500b6fcbe43ed90b0da03b909b2d0.scope
mount.nfs: requested NFS version or transport protocol is not supported
5m53s Warning FailedMount pod/nfs-client-provisioner-7497897b88-92lm7 MountVolume.SetUp failed for volume "nfs-client-root" : mount failed: exit status 32
Actually I’m wondering if this is a bug in the version of the kubernetes-worker I used that’s included in charmed-kubernetes-270 for kube 1.15, as this seems to be an issue in the worker’s mount relation that is not present in latest…
Tried removing NFS and re-adding it to the model with nfs options tcp,nfsvers=4
but that didn’t seem to change the problem on the worker side. Going to try this again by destroying the model and redeploying with these options the first time in case my changes just aren’t taking…
Destroying and creating did not resolve the issue, and adding the mount options didn’t help either:
routhinator@andromeda:~$ kubectl get events
LAST SEEN TYPE REASON OBJECT MESSAGE
12m Warning FailedMount pod/nfs-client-provisioner-7fbbc85766-spwpd Unable to mount volumes for pod "nfs-client-provisioner-7fbbc85766-spwpd_default(6049df72-1131-4a8f-904e-71e01f86e556)": timeout expired waiting for volumes to attach or mount for pod "default"/"nfs-client-provisioner-7fbbc85766-spwpd". list of unmounted volumes=[nfs-client-root]. list of unattached volumes=[nfs-client-root default-token-qh2g5]
3m59s Warning FailedMount pod/nfs-client-provisioner-7fbbc85766-spwpd (combined from similar events): MountVolume.SetUp failed for volume "nfs-client-root" : mount failed: exit status 32
Mounting command: systemd-run
Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/6049df72-1131-4a8f-904e-71e01f86e556/volumes/kubernetes.io~nfs/nfs-client-root --scope -- mount -t nfs 192.168.54.189:/srv/data/kubernetes-worker /var/lib/kubelet/pods/6049df72-1131-4a8f-904e-71e01f86e556/volumes/kubernetes.io~nfs/nfs-client-root
Output: Running scope as unit: run-r34da98346d734fb29dc54d4ca55d8b03.scope
mount.nfs: requested NFS version or transport protocol is not supported
Combing through the code on the kubernetes-worker, I see the deployment for the nfs provisioner doesn’t include any mount options passing, even in the latest version… So I guess I’m barking up the wrong tree with the client mount options. Still really confused about this error. I’ll start digging into the code for the NFS charm and see if I can find anything. It certainly seems (based on the error) that the provisioner is requesting nfsv3 from a nfs4 server…
Doh, I’ve been barking up the wrong tree. Looks like the Canonical docs should mention this gotcha for LXD deployments that’s mentioned on the Github page for the NFS charm, but I missed it initially.
Oddly, before I wiped my 1.16 install and downgraded, I did not have to do this last time:
On the LXC host:
apt-get install nfs-common
modprobe nfsd
mount -t nfsd nfsd /proc/fs/nfsd
Edit /etc/apparmor.d/lxc/lxc-default and add the following three lines to it:
mount fstype=nfs,
mount fstype=nfs4,
mount fstype=nfsd,
mount fstype=rpc_pipefs,
after which:
sudo /etc/init.d/apparmor restart
Finally:
juju deploy nfs
I’m thinking this is what made this suddenly fail, though what has changed from last time I deployed to this, I cannot imagine.
However I can definitely see that the nfs server is failing to start in the LXC container:
root@juju-f9675b-5:/var/log/juju# journalctl -xe
--
-- Unit nfs-idmapd.service has failed.
--
-- The result is RESULT.
Oct 08 22:14:29 juju-f9675b-5 systemd[1]: nfs-idmapd.service: Job nfs-idmapd.service/start failed with result 'dependency'.
Oct 08 22:14:29 juju-f9675b-5 systemd[1]: Dependency failed for NFS Mount Daemon.
-- Subject: Unit nfs-mountd.service has failed
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
--
-- Unit nfs-mountd.service has failed.
--
-- The result is RESULT.
Oct 08 22:14:29 juju-f9675b-5 systemd[1]: nfs-mountd.service: Job nfs-mountd.service/start failed with result 'dependency'.
Oct 08 22:14:29 juju-f9675b-5 systemd[1]: nfs-server.service: Job nfs-server.service/start failed with result 'dependency'.
Oct 08 22:14:29 juju-f9675b-5 mount[542]: mount: /run/rpc_pipefs: permission denied.
Oct 08 22:14:29 juju-f9675b-5 systemd[1]: Started Preprocess NFS configuration.
-- Subject: Unit nfs-config.service has finished start-up
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
--
-- Unit nfs-config.service has finished starting up.
--
-- The start-up result is RESULT.
Oct 08 22:14:29 juju-f9675b-5 systemd[1]: run-rpc_pipefs.mount: Mount process exited, code=exited status=32
Oct 08 22:14:29 juju-f9675b-5 systemd[1]: run-rpc_pipefs.mount: Failed with result 'exit-code'.
Oct 08 22:14:29 juju-f9675b-5 systemd[1]: Failed to mount RPC Pipe File System.
-- Subject: Unit run-rpc_pipefs.mount has failed
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
--
-- Unit run-rpc_pipefs.mount has failed.
--
-- The result is RESULT.
Oct 08 22:14:29 juju-f9675b-5 systemd[1]: Dependency failed for RPC security service for NFS client and server.
-- Subject: Unit rpc-gssd.service has failed
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
--
-- Unit rpc-gssd.service has failed.
--
-- The result is RESULT.
Oct 08 22:14:29 juju-f9675b-5 systemd[1]: rpc-gssd.service: Job rpc-gssd.service/start failed with result 'dependency'.
Oct 08 22:14:29 juju-f9675b-5 systemd[1]: Dependency failed for RPC security service for NFS server.
-- Subject: Unit rpc-svcgssd.service has failed
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
--
-- Unit rpc-svcgssd.service has failed.
--
-- The result is RESULT.
Oct 08 22:14:29 juju-f9675b-5 systemd[1]: rpc-svcgssd.service: Job rpc-svcgssd.service/start failed with result 'dependency'.
I didn’t think to look at this at all as the Juju interface reported that all was well with the NFS unit… I guess it doesn’t actually check that the nfs service is operational.
Oddly, even with all the advice applied, and the mount permission applied to all variants of the lxd apparmor profiles, restarting the contianers, the whole server… I cannot get past the failure to mount RPC pipe file system.
Ok this looks like something that didn’t happen this time that did before.
I managed to get the NFS server running and resolve the issue, but I had to do the following config modifications to the NFS container:
lxc config set juju-741e57-5 raw.apparmor="mount fstype=rpc_pipefs, mount fstype=nfsd,"
lxc config set juju-741e57-5 security.privileged true
Don’t see this mentioned anywhere, and the first command should have been covered in the apparmor profile. Is this a new issue?
Hey @routhinator, nice detective work (and persistence). It looks like we should get a LXD profile added to the nfs charm. As an example, here’s the one we added for kubernetes-worker in the 1.16 release.
If a lxd-profile.yaml exists in the root of the charm, Juju will apply it when deploying the charm to LXD - more info in the docs.
You could test it by cloning the nfs repo, adding a lxd-profile.yaml with the rules you need, charm build
it, then juju deploy /path/to/local/nfs/charm
.
As an example, here’s the one we added for kubernetes-worker in the 1.16 release.
Oh interesting, would this happen to be related to my new problem, which is not being able to get outgoing connections from pods?
# Gitlab runner:
ERROR: Registering runner... failed runner=a5sszn7s status=couldn't execute POST against https://gitlab.routh.io/api/v4/runners: Post https://gitlab.routh.io/api/v4/runners: dial tcp: i/o timeout
PANIC: Failed to register this runner. Perhaps you are having network problems
#cert-manager issuers
3m20s Warning ErrVerifyACMEAccount clusterissuer/letsencrypt-prod Failed to verify ACME account: Get https://acme-v02.api.letsencrypt.org/directory: dial tcp: i/o timeout
3m21s Warning ErrVerifyACMEAccount clusterissuer/letsencrypt-staging Failed to verify ACME account: Get https://acme-staging-v02.api.letsencrypt.org/directory: dial tcp: i/o time
This wasn’t a problem when I deployed 1.16, but looking at the 1.15 commit for this the profile file isn’t there. I’m guessing all of this is related to what the docs mention about using conjure-up
with LXD as it does extra configuration.
So I guess I need to add these perms to my Kube worker and master to fix my odd issues that remain.
I doubt it. Even though you’re using 1.15, you’re still using the latest versions of the charms, which have the lxd profiles in them.
@tvansteenburgh
Coming back to this thread after a couple weeks of reinstalling and testing… I am thinking this new method of installing is still missing some magic that conjure-up
does to LXC to make kubernetes work.
To clarfiy I still cannot get this to deploy with juju deploy
on a fresh LXD cluster without broken networking or DNS… still not familiar enough with kubernetes to narrow down which.
This works when I do the following:
- Wipe server and install LXD fresh
- Deploy charmed-kubernetes with conjure up
- Remove conjure-up controller and model/containers
- Run
juju deploy
after bootstrapping a fresh controller
If I do this, I get a working cluster.
When I:
- Wipe server and install LXD fresh
- Bootstrap a juju controller
- Run
juju deploy
The cluster says it’s up, it can pull and deploy workloads, but those workloads cannot connect to the internet or cluster services, and the services do not respond from the internet.
Any requests to the internet from the pods results in: dial tcp: i/o timeout
I was digging and digging on what to do here and found this old Github issue - https://github.com/charmed-kubernetes/bundle/issues/286 - So I ran the test that @ktsakalozos asked the OP to run, and this is what I get:
ansible@andromeda:~$ kubectl apply -f https://k8s.io/examples/application/shell-demo.yaml
pod/shell-demo created
ansible@andromeda:~$ kubectl exec -it shell-demo -- /bin/bash
error: unable to upgrade connection: container not found ("nginx")
ansible@andromeda:~$ kubectl exec -it shell-demo -- /bin/bash
error: unable to upgrade connection: container not found ("nginx")
ansible@andromeda:~$ kubectl exec -it shell-demo -- /bin/bash
error: unable to upgrade connection: container not found ("nginx")
ansible@andromeda:~$ kubectl get pod shell-demo
NAME READY STATUS RESTARTS AGE
shell-demo 1/1 Running 0 28s
ansible@andromeda:~$ kubectl exec -it shell-demo -- /bin/bash
root@juju-992433-4:/#
root@juju-992433-4:/# getent hosts default-http-backend
root@juju-992433-4:/#
So for some reason the default-http-backend comes up empty when things are deployed with pure juju, but this problem does not exist when the LXD cluster is prepared by Conjure-up first… something is off but I’m not sure where else to look.
A bit more data, these are the logs from the Core DNS pods:
ansible@andromeda:~$ kubectl logs -n kube-system -f pod/coredns-78d9f9956c-97f9c
2019-10-19T01:24:15.219Z [INFO] plugin/ready: Still waiting on: "kubernetes"
.:53
2019-10-19T01:24:17.125Z [INFO] plugin/reload: Running configuration MD5 = 76dd40a8d85f49bc080c15939532be01
2019-10-19T01:24:17.126Z [INFO] CoreDNS-1.5.1
2019-10-19T01:24:17.126Z [INFO] linux/amd64, go1.12.6, 6c33397
CoreDNS-1.5.1
linux/amd64, go1.12.6, 6c33397
E1019 01:30:13.545733 1 reflector.go:270] pkg/mod/k8s.io/client-go@v11.0.0+incompatible/tools/cache/reflector.go:94: Failed to watch *v1.Namespace: Get https://10.152.183.1:443/api/v1/namespaces?resourceVersion=68050&timeout=5m19s&timeoutSeconds=319&watch=true: dial tcp 10.152.183.1:443: connect: connection refused
E1019 01:30:13.545906 1 reflector.go:270] pkg/mod/k8s.io/client-go@v11.0.0+incompatible/tools/cache/reflector.go:94: Failed to watch *v1.Endpoints: Get https://10.152.183.1:443/api/v1/endpoints?resourceVersion=69468&timeout=8m26s&timeoutSeconds=506&watch=true: dial tcp 10.152.183.1:443: connect: connection refused
E1019 01:30:13.546043 1 reflector.go:270] pkg/mod/k8s.io/client-go@v11.0.0+incompatible/tools/cache/reflector.go:94: Failed to watch *v1.Service: Get https://10.152.183.1:443/api/v1/services?resourceVersion=68050&timeout=7m7s&timeoutSeconds=427&watch=true: dial tcp 10.152.183.1:443: connect: connection refused
2019-10-19T17:15:06.047Z [ERROR] plugin/errors: 2 monitoring-influxdb.home.routh.io. A: read udp 10.1.1.39:36492->192.168.52.1:53: i/o timeout
2019-10-19T17:15:12.139Z [ERROR] plugin/errors: 2 monitoring-influxdb. A: read udp 10.1.1.39:47426->192.168.52.1:53: i/o timeout
2019-10-19T17:15:12.139Z [ERROR] plugin/errors: 2 monitoring-influxdb. AAAA: read udp 10.1.1.39:39629->192.168.52.1:53: i/o timeout
2019-10-19T17:15:17.139Z [ERROR] plugin/errors: 2 monitoring-influxdb. AAAA: read udp 10.1.1.39:45793->192.168.52.1:53: i/o timeout
2019-10-19T17:15:37.338Z [ERROR] plugin/errors: 2 monitoring-influxdb.home.routh.io. A: read udp 10.1.1.39:50122->192.168.52.1:53: i/o timeout
2019-10-19T17:15:42.438Z [ERROR] plugin/errors: 2 monitoring-influxdb.home.routh.io. AAAA: read udp 10.1.1.39:59166->192.168.52.1:53: i/o timeout
2019-10-19T17:15:42.438Z [ERROR] plugin/errors: 2 monitoring-influxdb.home.routh.io. A: read udp 10.1.1.39:36316->192.168.52.1:53: i/o timeout
2019-10-19T17:15:47.438Z [ERROR] plugin/errors: 2 monitoring-influxdb. A: read udp 10.1.1.39:50043->192.168.52.1:53: i/o timeout
2019-10-19T17:16:12.151Z [ERROR] plugin/errors: 2 monitoring-influxdb. A: read udp 10.1.1.39:59633->192.168.52.1:53: i/o timeout
2019-10-19T17:16:37.439Z [ERROR] plugin/errors: 2 monitoring-influxdb.home.routh.io. AAAA: read udp 10.1.1.39:41581->192.168.52.1:53: i/o timeout
2019-10-19T17:16:42.538Z [ERROR] plugin/errors: 2 monitoring-influxdb.home.routh.io. AAAA: read udp 10.1.1.39:60089->192.168.52.1:53: i/o timeout
2019-10-19T17:16:52.538Z [ERROR] plugin/errors: 2 monitoring-influxdb. A: read udp 10.1.1.39:55278->192.168.52.1:53: i/o timeout
2019-10-19T17:17:05.005Z [ERROR] plugin/errors: 2 monitoring-influxdb.home.routh.io. A: read udp 10.1.1.39:60724->192.168.52.1:53: read: connection refused
2019-10-19T17:17:12.038Z [ERROR] plugin/errors: 2 monitoring-influxdb. AAAA: read udp 10.1.1.39:50721->192.168.52.1:53: i/o timeout
2019-10-19T17:17:15.081Z [ERROR] plugin/errors: 2 monitoring-influxdb. AAAA: read udp 10.1.1.39:59741->192.168.52.1:53: read: connection refused
E1019 17:34:02.285248 1 reflector.go:270] pkg/mod/k8s.io/client-go@v11.0.0+incompatible/tools/cache/reflector.go:94: Failed to watch *v1.Endpoints: Get https://10.152.183.1:443/api/v1/endpoints?resourceVersion=238639&timeout=6m41s&timeoutSeconds=401&watch=true: dial tcp 10.152.183.1:443: connect: connection refused
E1019 17:34:02.285310 1 reflector.go:270] pkg/mod/k8s.io/client-go@v11.0.0+incompatible/tools/cache/reflector.go:94: Failed to watch *v1.Namespace: Get https://10.152.183.1:443/api/v1/namespaces?resourceVersion=69472&timeout=7m15s&timeoutSeconds=435&watch=true: dial tcp 10.152.183.1:443: connect: connection refused
E1019 17:34:02.285349 1 reflector.go:270] pkg/mod/k8s.io/client-go@v11.0.0+incompatible/tools/cache/reflector.go:94: Failed to watch *v1.Service: Get https://10.152.183.1:443/api/v1/services?resourceVersion=69472&timeout=8m46s&timeoutSeconds=526&watch=true: dial tcp 10.152.183.1:443: connect: connection refused
E1020 00:20:11.732656 1 reflector.go:270] pkg/mod/k8s.io/client-go@v11.0.0+incompatible/tools/cache/reflector.go:94: Failed to watch *v1.Service: Get https://10.152.183.1:443/api/v1/services?resourceVersion=238640&timeout=5m56s&timeoutSeconds=356&watch=true: dial tcp 10.152.183.1:443: connect: connection refused
E1020 00:20:11.732736 1 reflector.go:270] pkg/mod/k8s.io/client-go@v11.0.0+incompatible/tools/cache/reflector.go:94: Failed to watch *v1.Namespace: Get https://10.152.183.1:443/api/v1/namespaces?resourceVersion=238640&timeout=7m14s&timeoutSeconds=434&watch=true: dial tcp 10.152.183.1:443: connect: connection refused
E1020 00:20:11.732817 1 reflector.go:270] pkg/mod/k8s.io/client-go@v11.0.0+incompatible/tools/cache/reflector.go:94: Failed to watch *v1.Endpoints: Get https://10.152.183.1:443/api/v1/endpoints?resourceVersion=309899&timeout=5m29s&timeoutSeconds=329&watch=true: dial tcp 10.152.183.1:443: connect: connection refused
AHA! - Finally I found what was missing from a pure Juju install vs a conjure-up install.
The br_netfilter
kernel module.
root@my-shell-5dbb49b954-rj8nd:/# dig google.com
;; reply from unexpected source: 10.1.20.10#53, expected 10.152.183.219#53
;; reply from unexpected source: 10.1.20.10#53, expected 10.152.183.219#53
lxc profile create netfilter
$ lxc profile set netfilter linux.kernel_modules br_netfilter
$ lxc profile show netfilter
name: netfilter
config:
linux.kernel_modules: br_netfilter
description: ""
devices: {}
lxc profile apply juju-91f454-3 default,juju-default,juju-default-kubernetes-master-754,netfilter
lxc profile apply juju-91f454-4 default,juju-default,juju-default-kubernetes-worker-590,netfilter
lxc restart juju-91f454-4
lxc restart juju-91f454-5
root@my-shell-5dbb49b954-mkdj5:/# dig google.com
; <<>> DiG 9.10.3-P4-Ubuntu <<>> google.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 56774
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;google.com. IN A
;; ANSWER SECTION:
google.com. 30 IN A 172.217.14.238
;; Query time: 18 msec
;; SERVER: 10.152.183.219#53(10.152.183.219)
;; WHEN: Tue Nov 12 23:32:45 UTC 2019
;; MSG SIZE rcvd: 65
And now DNS works. This is what was missing.
1 Like