GPU Operator - Charmed Kubernetes

kirzon · 20 June 2024 07:16

I have the following:

juju-dfe073-0 Ready control-plane 7h27m v1.28.11 10.217.162.83 Ubuntu 22.04.4 LTS 6.5.0-1017-aws containerd://1.7.12

juju-dfe073-1 Ready 7h27m v1.28.11 10.217.162.248 Ubuntu 22.04.4 LTS 6.5.0-1017-aws containerd://1.7.12

While installing GPU Charmed Operator - I got the folloiwng error

juju deploy nvidia-gpu-operator --channel 1.29/stable --trust

ERROR Charm feature requirements cannot be met:

charm requires feature “k8s-api” but model does not support it

Any advice here ?

kwmonroe · 21 June 2024 03:50

Hi @kirzon, thanks for your question!

The issue you’re facing is that the nvidia-gpu-operator charm is meant to be deployed on a kubernetes cloud. It looks like you are attempting to deploy it to a machine cloud (aws, maas, openstack, etc).

The good news is that since you’ve deployed charmed kubernetes, you have everything you need to create a k8s cloud/model for use with juju. You can find our charmed kubernetes + gpu deployment guide with those details here:

https://ubuntu.com/kubernetes/docs/gpu-workers

Specifically see the “Deploying the GPU Operator” section for adding a k8s cloud/model atop charmed kubernetes.

We can do a better job with the error message you encountered, so I opened the following juju bug to try and clarify that:

Thanks again for the question, and let us know if you run into any more issues.

kirzon · 23 June 2024 13:01

Thanks for your help ! I took another approach ( it worked ) - hope i did not violate the Juju echo-system:)

I installed using kubernetes core - then i just installed using the GPU Helm operator from NVIDIA. It worked - all was successful . Guess i need to learn more about the cloud models.

My main goal is more around the bare-metal deployments . If there is a reference architecture of charmed in production grade - can be great. And Air-Gapped can be a nice bonus .

Thanks allot for your help !

afrogrit · 5 February 2025 12:11

Hi,

The above is not working for me when I try and run the nvidia-test.yaml file, its just blank. on a bare metal, maas cloud with charmed k8s

~$ juju status
Model      Controller         Cloud/Region       Version  SLA          Timestamp
rtx-model  oya-cloud-default  gpu-cloud/default  3.6.2    unsupported  12:10:23Z

App                  Version  Status  Scale  Charm                Channel      Rev  Address         Exposed  Message
nvidia-gpu-operator  v23.9.0  active      1  nvidia-gpu-operator  1.29/stable    4  10.152.183.171  no       Versions: gpu-operator=v23.9.0

Unit                    Workload  Agent  Address        Ports  Message
nvidia-gpu-operator/0*  active    idle   192.168.248.6         Ready
~$
:~$ kubectl logs job.batch/nvidia-smi
~$

thanks for the help

gpu-operator-699456dcb5-7l8bn                                1/1     Running   0              32m
nvidia-charm-node-feature-discovery-gc-78fd68d66f-kwz84      1/1     Running   0              32m
nvidia-charm-node-feature-discovery-master-69cb7f4db-w5g6l   1/1     Running   0              32m
nvidia-charm-node-feature-discovery-worker-422hm             1/1     Running   0              32m
nvidia-charm-node-feature-discovery-worker-9jw45             1/1     Running   0              32m
nvidia-charm-node-feature-discovery-worker-dzx9t             1/1     Running   0              32m
nvidia-charm-node-feature-discovery-worker-sklgt             1/1     Running   0              32m
nvidia-charm-node-feature-discovery-worker-vlb45             1/1     Running   0              32m
nvidia-smi-9h7fz                                             0/1     Pending   0              18m