Install on NVIDIA DGX

munteanuandreea · 14 March 2023 09:03

This guide describes how to install Charmed Kubeflow (CKF) on NVIDIA DGX hardware. DGX systems are purpose-built hardware for enterprise AI use cases, featuring NVIDIA Tensor Core GPUs.

Requirements

NVIDIA DGX-enabled hardware setup, including no NVIDIA drivers preinstalled, BIOS settings and bootloader.
kubectl.

Install MicroK8s

Install MicroK8s and enable required add-ons as follows:

sudo snap install microk8s --classic --channel 1.22
 
sudo microk8s enable dns:10.229.32.21 storage ingress registry rbac helm3 metallb:10.64.140.43-10.64.140.49,192.168.0.105-192.168.0.111
 
sudo usermod -a -G microk8s ubuntu
sudo chown -f -R ubuntu ~/.kube
newgrp microk8s

Edit /var/snap/microk8s/current/args/containerd-template.toml by adding:

[plugins."io.containerd.grpc.v1.cri".registry.configs]

[plugins."io.containerd.grpc.v1.cri".registry.configs."registry-1.docker.io".auth]
username = "afrikha"
password = "<>"

Finally , restart MicroK8s:

microk8s.stop
microk8s.start

Enable GPU add-on

Install the required GPU operator as follows:

sudo microk8s.enable gpu
mkdir .kube
microk8s config > ~/.kube/config

Check the GPU count for MicroK8s:

kubectl get nodes --show-labels | grep gpu.count

Configure MIG

Configure MIG devices running the following command:

kubectl label nodes blanka nvidia.com/mig.config=all-1g.5gb --overwrite

Check again the GPU count to confirm it has increased:

kubectl get nodes --show-labels | grep gpu.count

If no nodes appear in the command output above, uninstall all GPU drivers form K8s nodes and reinstall MicroK8s.

Deploy CKF

Follow the instructions in General installation for this section.

Explore some examples

CKF can be run on different types of DGX hardware:

See kubeflow-single-node-dgx for single-node examples.
See kubeflow-multi-node-dgx for multi-node examples.