NVIDIA DGX systems are purpose-built hardware for enterprise AI use cases. These platforms feature NVIDIA Tensor Core GPUs, which vastly outperform traditional CPUs for machine learning workloads, alongside advanced networking and storage capabilities.
This guide contains setup instructions for running Charmed Kubeflow on single-node Nvidia’s DGX-enabled hardware. It supports both single-node and multi-node environments, having a couple of examples of how to use two components: Jupyter Notebooks and Kubeflow Pipelines.
Requirements:
- Nvidia DGX-enabled hardware setup with correctly configured/updated BIOS settings, bootloader, OS, drivers, and packages (sample setup instructions provided below).
- Familiarity with Python, Docker, Jupyter notebooks.
- Tools:
juju
,kubectl
Sample Ubuntu and Grub setup
NOTE: The following setup instructions are given only as an example. There is no guarantee that they will be sufficient for all environments. Contact hardware distributor for more details on specific system setup. This document was tested on Ubuntu 20.04 vanilla
Ensure No Drivers Preinstalled
Make sure you don’t have any NVIDIA drivers preinstalled. You can do that with the following steps:
Check for apt packages :
$ sudo apt list --installed | grep nvidia
If any packages are listed, remove them:
$ sudo apt remove <package-name>
$ sudo apt autoremove
Check for kernel modules (if empty you OK):
$ lsmod | grep nvidia
If any modules are listed, remove them:
sudo modprobe -r <module-name>
Reboot
$ sudo reboot
Grub Setup
Edit /etc/default/grub
and add the following options to
GRUB_CMDLINE_LINUX_DEFAULT: modprobe.blacklist=nouveau nouveau.modeset=
$ sudo reboot
Contents:
- Install Kubernetes (MicroK8s)
- Enable GPU add-on and configure MIG
- Deploy Charmed Kubeflow
- Try Kubeflow examples
Install Kubernetes (MicroK8s)
Install microk8s and enable required addons:
$ sudo snap install microk8s --classic --channel 1.22
$ sudo microk8s enable dns:10.229.32.21 storage ingress registry rbac helm3 metallb:10.64.140.43-10.64.140.49,192.168.0.105-192.168.0.111
$ sudo usermod -a -G microk8s ubuntu
$ sudo chown -f -R ubuntu ~/.kube
$ newgrp microk8s
Edit /var/snap/microk8s/current/args/containerd-template.toml
. Add:
[plugins."io.containerd.grpc.v1.cri".registry.configs]
[plugins."io.containerd.grpc.v1.cri".registry.configs."registry-1.docker.io".auth]
username = "afrikha"
password = "<>"
$ microk8s.stop; microk8s.start
Enable GPU add-on and configure MIG
Install GPU operator:
$ sudo microk8s.enable gpu
$ mkdir .kube
$ microk8s config > ~/.kube/config
Check gpu count for k8s:
$ kubectl get nodes --show-labels | grep gpu.count
Configure MIG devices:
$ kubectl label nodes blanka nvidia.com/mig.config=all-1g.5gb --overwrite
Recheck gpu count (should be increased):
kubectl get nodes --show-labels | grep gpu.count
Troubleshooting: If none of the nodes appear in the get nodes
command, uninstall all GPU drivers form kubernetes nodes and reinstall the microk8s.
Deploy Charmed Kubeflow
Follow the instructions from How to install Charmed Kubeflow to deploy Charmed Kubeflow.
Try Kubeflow examples
Charmed Kubeflow can be run on single-node and multi-node DGX hardware. Depending on the environment, there are different requirements that should be followed. There are multiple examples that can be tried out.
Single-node DGX with Charmed Kubeflow examples
There is a GitHub repository that includes all the details about the Single-node DGX with Charmed Kubeflow.
The following examples can be found and tested:
- Jupyter Notebook example on a single-node DGX in the file
gpu-notebook.ipynb
from the repository. It also uses multi GPU setup. - Kubeflow Pipeline example on a single-node DGX, that uses the same classifier as the Notebook. It is available in the file
gpu-pipeline.ipynb
.
Multi-node DGX with Charmed Kubeflow examples
There is a GitHub repository that includes all the details about the Multi-node DGX with Charmed Kubeflow.
The following examples can be found and tested:
- Training Tensorflow models with multi-GPUs in a Jupyter Notebook using Charmed Kubeflow in the folder
multi-gpu-in-notebook
, where Jupyter Notebook file is available,gpu-notebook.ipynb
. - Training Tensorflow models with GPUs in a Kubeflow Pipeline in the folder
multi-gpu-in-pipeline
. - A simulated example of multi-node training in Tensorflow, but using just a single node in the folder
multi-node-gpu-simulated
. There are going to be multiple files describing the workload distribution and how to run it. - Multi-node training in Tensorflow using the Kubeflow Training Operator’s TFJob in the folder
multi-node-gpu-tfjob
.