How to set up Charmed Kubeflow on NVIDIA DGX

NVIDIA DGX systems are purpose-built hardware for enterprise AI use cases. These platforms feature NVIDIA Tensor Core GPUs, which vastly outperform traditional CPUs for machine learning workloads, alongside advanced networking and storage capabilities.

This guide contains setup instructions for running Charmed Kubeflow on single-node Nvidia’s DGX-enabled hardware. It supports both single-node and multi-node environments, having a couple of examples of how to use two components: Jupyter Notebooks and Kubeflow Pipelines.

Requirements:

  • Nvidia DGX-enabled hardware setup with correctly configured/updated BIOS settings, bootloader, OS, drivers, and packages (sample setup instructions provided below).
  • Familiarity with Python, Docker, Jupyter notebooks.
  • Tools: juju, kubectl
Sample Ubuntu and Grub setup

NOTE: The following setup instructions are given only as an example. There is no guarantee that they will be sufficient for all environments. Contact hardware distributor for more details on specific system setup. This document was tested on Ubuntu 20.04 vanilla

Ensure No Drivers Preinstalled

Make sure you don’t have any NVIDIA drivers preinstalled. You can do that with the following steps:

Check for apt packages :

$ sudo apt list --installed | grep nvidia

If any packages are listed, remove them:

$ sudo apt remove <package-name>
$ sudo apt autoremove

Check for kernel modules (if empty you OK):

$ lsmod | grep nvidia

If any modules are listed, remove them:

sudo modprobe -r <module-name>

Reboot

$ sudo reboot

Grub Setup

Edit /etc/default/grub and add the following options to

GRUB_CMDLINE_LINUX_DEFAULT: modprobe.blacklist=nouveau nouveau.modeset=
$ sudo reboot

Contents:

Install Kubernetes (MicroK8s)

Install microk8s and enable required addons:

$ sudo snap install microk8s --classic --channel 1.22
 
$ sudo microk8s enable dns:10.229.32.21 storage ingress registry rbac helm3 metallb:10.64.140.43-10.64.140.49,192.168.0.105-192.168.0.111
 
$ sudo usermod -a -G microk8s ubuntu
$ sudo chown -f -R ubuntu ~/.kube
$ newgrp microk8s

Edit /var/snap/microk8s/current/args/containerd-template.toml. Add:

[plugins."io.containerd.grpc.v1.cri".registry.configs]

[plugins."io.containerd.grpc.v1.cri".registry.configs."registry-1.docker.io".auth]
username = "afrikha"
password = "<>"
$ microk8s.stop; microk8s.start

Enable GPU add-on and configure MIG

Install GPU operator:

$ sudo microk8s.enable gpu
$ mkdir .kube
$ microk8s config > ~/.kube/config

Check gpu count for k8s:

$ kubectl get nodes --show-labels | grep gpu.count

Configure MIG devices:

$ kubectl label nodes blanka nvidia.com/mig.config=all-1g.5gb --overwrite

Recheck gpu count (should be increased):

kubectl get nodes --show-labels | grep gpu.count

Troubleshooting: If none of the nodes appear in the get nodes command, uninstall all GPU drivers form kubernetes nodes and reinstall the microk8s.

Deploy Charmed Kubeflow

Follow the instructions from How to install Charmed Kubeflow to deploy Charmed Kubeflow.

Try Kubeflow examples

Charmed Kubeflow can be run on single-node and multi-node DGX hardware. Depending on the environment, there are different requirements that should be followed. There are multiple examples that can be tried out.

Single-node DGX with Charmed Kubeflow examples

There is a GitHub repository that includes all the details about the Single-node DGX with Charmed Kubeflow.

The following examples can be found and tested:

  • Jupyter Notebook example on a single-node DGX in the file gpu-notebook.ipynb from the repository. It also uses multi GPU setup.
  • Kubeflow Pipeline example on a single-node DGX, that uses the same classifier as the Notebook. It is available in the file gpu-pipeline.ipynb.
Multi-node DGX with Charmed Kubeflow examples

There is a GitHub repository that includes all the details about the Multi-node DGX with Charmed Kubeflow.

The following examples can be found and tested:

  • Training Tensorflow models with multi-GPUs in a Jupyter Notebook using Charmed Kubeflow in the folder multi-gpu-in-notebook, where Jupyter Notebook file is available, gpu-notebook.ipynb.
  • Training Tensorflow models with GPUs in a Kubeflow Pipeline in the folder multi-gpu-in-pipeline.
  • A simulated example of multi-node training in Tensorflow, but using just a single node in the folder multi-node-gpu-simulated. There are going to be multiple files describing the workload distribution and how to run it.
  • Multi-node training in Tensorflow using the Kubeflow Training Operator’s TFJob in the folder multi-node-gpu-tfjob.