Kubeflow can take advantage of GPUs to significantly reduce the time required to complete complex or processor-intensive operations. The GPU hardware must be made accessible to Kubeflow, and your workflow must be created to take advantage of them.
These instructions assume that :
-
You have already installed Kubeflow on your cluster (full or lite version).
-
You have logged in to the Kubeflow dashboard.
-
You have access to the internet for downloading the required example code (notebooks, pipelines) .
-
You can run Python 3 code in a local terminal (required for compiling the pipeline).
-
Your Kubernetes cluster has an NVIDIA GPU attached to it.
This documentation will go through a typical, basic workflow so you can familiarise yourself with using Kubeflow with a GPU.
Enable GPUs for your cluster
To start, you will need to have access to a Kubernetes cluster that has an NVIDIA GPU available. The Kubernetes cluster will also need to be aware of the GPU attached to it.
The method for enabling the GPU on Kubernetes varies slightly depending on how Kubernetes itself has been deployed. The following links have more information:
-
For MicroK8s, this is as easy as running
microk8s enable gpu
. You can also read the MicroK8s documentation for GPU enablement. -
For Charmed Kubernetes, see this documentation: https://ubuntu.com/kubernetes/docs/gpu-workers.
-
For other clusters, see this documentation: https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/
Create a GPU pipeline
We will be examining and running an object detection pipeline available directly from the Kubeflow bundle repository. This is a pipeline used as part of the test framework for the Kubeflow bundle, and is based on the excellent “pet detector” created for the TensorFlow Object Detection API.
You can examine the complete example pipeline in the Kubeflow bundle repository:
https://github.com/canonical/bundle-kubeflow/blob/master/tests/pipelines/object_detection.py
This pipeline is an adaption of the pet detector built on top of the TensorFlow Object Detection API found here:
https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/running_pets.md
The rest of this section will describe the main components of the pipeline.
Dockerfile
To start, here is the Dockerfile that this pipeline uses:
https://github.com/canonical/bundle-kubeflow/blob/master/tests/Dockerfile.object_detection
Notice that it builds on top of the tensorflow/tensorflow:1.15.2-gpu-py3
image. Any GPU-based pipelines you run must include the appropriate GPU libraries within the Docker image. The rest of the Dockerfile is concerned with cloning the TensorFlow models repository and setting up its dependencies.
Loading data
The first step in the pipeline deals with downloading and converting the input images from the JPG format to the TFRecord format. Notice that the records
and validation_images
arguments are both of type OutputBinaryFile(str)
, which represent files on disk that will be saved to object storage by Kubeflow Pipelines.
We store 10 validation images in the validation_images
output. These validation images will later be used to ensure that the model is being served correctly. The actual conversion of the JPG files is handled by this script available in the TensorFlow models repository.
Training the model
The next step is to train the model. We start by downloading a model that was pretrained on the COCO Dataset. We then create a /pipeline.config
file that is passed to the model_main.py file, which runs the training sequence. For this example, we have set the default number of training steps to be 1. The pipeline exposes a pipeline_steps
parameter that can be used to increase this number for real-world workloads. After training, we export the updated model in a format that can be loaded by TensorFlow Serving. This updated model gets written to the file represented by the exported
step argument, which means it will be stored in object storage by Kubeflow Pipelines, and will be available for the next step to use.
Testing the model
Now that we have a trained model, we can serve it and test that it is being served properly. To do this, we’ll have the testing step consist of a multi-container Pod. The main container has the testing code, and the sidecar container will serve the model with TensorFlow Serving. The main container extracts the model to /output/
, which the sidecar has access to. The main container then waits for the sidecar to come up, before testing the metadata and predict endpoints to ensure that they respond appropriately
Defining the pipeline
Now that each step is defined, we can define the overall pipeline. Each step is created with its inputs, and then step.container.set_gpu_limit(1)
is called on it, which requests 1 GPU for the step. This can be changed for multi-GPU workloads by increasing the number appropriately. We also set up the sidecar for the testing step with test.add_sidecar(serve_sidecar().set_gpu_limit(1))
def object_detection_pipeline(
images='https://people.canonical.com/~knkski/images.tar.gz',
annotations='https://github.com/canonical/bundle-kubeflow/blob/test-artifacts/tests/pipelines/artifacts/annotations.tar.gz',
pretrained='https://people.canonical.com/~knkski/faster_rcnn_resnet101_coco_11_06_2017.tar.gz',
):
loaded = load_task(images, annotations)
loaded.container.set_gpu_limit(1)
train = train_task(loaded.outputs['records'], pretrained)
train.container.set_gpu_limit(1)
test = test_task(train.outputs['exported'], loaded.outputs['validation_images'])
test.add_sidecar(serve_sidecar().set_gpu_limit(1))
dsl.get_pipeline_conf().add_op_transformer(attach_output_volume)
The complete, compiled YAML version of the GPU test pipeline is included in the kubeflow repository. You can download it directly from this link.