Charmed-kubernetes, nvidia gpu-operator install issues

schwim · 28 June 2022 22:22

I have a new charmed-kubernetes install I’m testing. MaaS is the cloud provider. All systems provision and come up correctly. However, I am attempting to use the Nvidia gpu-operator for the gpu nodes. Details:

https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/getting-started.html

The install starts off but then I run into an error in one of the pods on:

1.6564539553979924e+09 INFO controllers.ClusterPolicy DaemonSet not found, creating {“DaemonSet”: “nvidia-driver-daemonset”, “Namespace”: “gpu-operator”, “Name”: “nvidia-driver-daemonset”} 1.6564539554024365e+09 INFO controllers.ClusterPolicy Couldn’t create DaemonSet {“DaemonSet”: “nvidia-driver-daemonset”, “Namespace”: “gpu-operator”, “Name”: “nvidia-driver-daemonset”, “Error”: “DaemonSet.apps "nvidia-driver-daemonset" is invalid: [spec.template.spec.containers[0].securityContext.privileged: Forbidden: disallowed by cluster policy, spec.template.spec.initContainers[0].securityContext.privileged: Forbidden: disallowed by cluster policy]”}

1.6564539554024835e+09 ERROR controller.clusterpolicy-controller Reconciler error {“name”: “cluster-policy”, “namespace”: “”, “error”: “DaemonSet.apps "nvidia-driver-daemonset" is invalid: [spec.template.spec.containers[0].securityContext.privileged: Forbidden: disallowed by cluster policy, spec.template.spec.initContainers[0].securityContext.privileged: Forbidden: disallowed by cluster policy]”}

Any suggestions as to how to resolve this?

afrogrit · 28 February 2025 17:43

Hi @schwim,

did you solve the problem? I’m having the same issue

csi-rbdplugin-qjvjg                                          2/2     Running                 14 (49s ago)    44h
gpu-feature-discovery-5q4vz                                  0/1     Init:0/1                0               32m
gpu-feature-discovery-875cp                                  0/1     Init:0/1                0               32m
gpu-operator-586cc57c8f-jx9fk                                1/1     Running                 0               32m
gpu-test                                                     0/1     Pending                 0               21h
nvidia-charm-node-feature-discovery-gc-cd4d8cd49-27jlj       1/1     Running                 0               32m
nvidia-charm-node-feature-discovery-master-cd9bcdb94-w59kh   1/1     Running                 0               32m
nvidia-charm-node-feature-discovery-worker-bqrhs             1/1     Running                 0               32m
nvidia-charm-node-feature-discovery-worker-qvwtv             1/1     Running                 0               32m
nvidia-charm-node-feature-discovery-worker-tncvr             1/1     Running                 0               32m
nvidia-charm-node-feature-discovery-worker-v587d             1/1     Running                 2 (6m12s ago)   32m
nvidia-charm-node-feature-discovery-worker-x8ddj             0/1     Unknown                 0               32m
nvidia-container-toolkit-daemonset-6ghz4                     0/1     Init:CrashLoopBackOff   10 (29s ago)    32m
nvidia-container-toolkit-daemonset-fm6jg                     0/1     Init:Unknown            0               32m
nvidia-dcgm-exporter-6qksd                                   0/1     Init:0/1                0               32m
nvidia-dcgm-exporter-kvmcn                                   0/1     Init:0/1                0               32m
nvidia-device-plugin-daemonset-mm5lb                         0/1     Init:0/1                0               32m
nvidia-device-plugin-daemonset-pddxq                         0/1     Init:0/1                0               32m
nvidia-operator-validator-fwbvt                              0/1     Init:0/4                0               32m
nvidia-operator-validator-zqj94                              0/1     Init:0/4                0               32m
nvidia-smi-12-8-gtqfb                                        0/1     Pending                 0               63m
nvidia-smi-9rchc                                             0/1     Pending                 0               66m
:~$

from the deamonset

Warning  FailedCreatePodSandBox  14s                    kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create containerd task: failed to create shim task: OCI runtime create failed: unable to retrieve OCI runtime error (open /run/containerd/io.containerd.runtime.v2.task/k8s.io/1439dbd4154812ceec0f714fd71690449050df88695e4e8bc3706b0f657dffcc/log.json: no such file or directory): fork/exec /usr/bin/nvidia-container-runtime: no such file or directory: unknown
  Warning  FailedCreatePodSandBox  1s                     kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create containerd task: failed to create shim task: OCI runtime create failed: unable to retrieve OCI runtime error (open /run/containerd/io.containerd.runtime.v2.task/k8s.io/5afee71585855aeff01a33e4332a879e406cb88ba459e2f87388d6ab5d4d7ec4/log.json: no such file or directory): fork/exec /usr/bin/nvidia-container-runtime: no such file or directory: unknown

from the test pod

`  Warning  FailedScheduling  9m59s (x2 over 13m)    default-scheduler  0/5 nodes are available: 1 node(s) had untolerated taint {node.kubernetes.io/unreachable: }, 2 Insufficient nvidia.com/gpu, 2 node(s) had untolerated taint {node-role.kubernetes.io/control-plane: }. preemption: 0/5 nodes are available: 2 No preemption victims found for incoming pod, 3 Preemption is not helpful for scheduling.
  Warning  FailedScheduling  9m23s (x2 over 9m49s)  default-scheduler  0/5 nodes are available: 2 node(s) had untolerated taint {node-role.kubernetes.io/control-plane: }, 3 Insufficient nvidia.com/gpu. preemption: 0/5 nodes are available: 2 Preemption is not helpful for scheduling, 3 No preemption victims found for incoming pod.
  Warning  FailedScheduling  3m40s (x2 over 4m20s)  default-scheduler  0/5 nodes are available: 1 node(s) had untolerated taint {node.kubernetes.io/unreachable: }, 2 Insufficient nvidia.com/gpu, 2 node(s) had untolerated taint {node-role.kubernetes.io/control-plane: }. preemption: 0/5 nodes are available: 2 No preemption victims found for incoming pod, 3 Preemption is not helpful for scheduling.
  Warning  FailedScheduling  2m23s (x2 over 3m29s)  default-scheduler  0/5 nodes are available: 2 node(s) had untolerated taint {node-role.kubernetes.io/control-plane: }, 3 Insufficient nvidia.com/gpu. preemption: 0/5 nodes are available: 2 Preemption is not helpful for scheduling, 3 No preemption victims found for incoming pod`

afrogrit · 28 February 2025 17:47

Below is what I’m reading inside the VM

NB: fast-hippo is the VM

:~$ kubectl describe node fast-hippo | grep -i gpu
                    nvidia.com/gpu-driver-upgrade-state=upgrade-done
                    nvidia.com/gpu.deploy.container-toolkit=true
                    nvidia.com/gpu.deploy.dcgm=true
                    nvidia.com/gpu.deploy.dcgm-exporter=true
                    nvidia.com/gpu.deploy.device-plugin=true
                    nvidia.com/gpu.deploy.driver=pre-installed
                    nvidia.com/gpu.deploy.gpu-feature-discovery=true
                    nvidia.com/gpu.deploy.node-status-exporter=true
                    nvidia.com/gpu.deploy.operator-validator=true
                    nvidia.com/gpu.present=true
                    nvidia.com/gpu-driver-upgrade-enabled: true
  default                          gpu-feature-discovery-875cp                         0 (0%)        0 (0%)       0 (0%)           0 (0%)               38m
:~$


`