Using GPUs with Kubeflow

evilnick · 2 May 2021 23:30

Kubeflow can take advantage of GPUs to significantly reduce the time required to complete complex or processor-intensive operations. The GPU hardware must be made accessible to Kubeflow, and your workflow must be created to take advantage of them.

These instructions assume that :

You have already installed Kubeflow on your cluster (full or lite version).
You have logged in to the Kubeflow dashboard.
You have access to the internet for downloading the required example code (notebooks, pipelines) .
You can run Python 3 code in a local terminal (required for compiling the pipeline).
Your Kubernetes cluster has an NVIDIA GPU attached to it.

This documentation will go through a typical, basic workflow so you can familiarise yourself with using Kubeflow with a GPU.

Enable GPUs for your cluster

To start, you will need to have access to a Kubernetes cluster that has an NVIDIA GPU available. The Kubernetes cluster will also need to be aware of the GPU attached to it.

The method for enabling the GPU on Kubernetes varies slightly depending on how Kubernetes itself has been deployed. The following links have more information:

For MicroK8s, this is as easy as running microk8s enable gpu. You can also read the MicroK8s documentation for GPU enablement.
For Charmed Kubernetes, see this documentation: https://ubuntu.com/kubernetes/docs/gpu-workers.
For other clusters, see this documentation: https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/

Create a GPU pipeline

We will be examining and running an object detection pipeline available directly from the Kubeflow bundle repository. This is a pipeline used as part of the test framework for the Kubeflow bundle, and is based on the excellent “pet detector” created for the TensorFlow Object Detection API.

pet image

You can examine the complete example pipeline in the Kubeflow bundle repository:

https://github.com/canonical/bundle-kubeflow/blob/master/tests/pipelines/object_detection.py

This pipeline is an adaption of the pet detector built on top of the TensorFlow Object Detection API found here:

https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/running_pets.md

The rest of this section will describe the main components of the pipeline.

Dockerfile

To start, here is the Dockerfile that this pipeline uses:

https://github.com/canonical/bundle-kubeflow/blob/master/tests/Dockerfile.object_detection

Notice that it builds on top of the tensorflow/tensorflow:1.15.2-gpu-py3 image. Any GPU-based pipelines you run must include the appropriate GPU libraries within the Docker image. The rest of the Dockerfile is concerned with cloning the TensorFlow models repository and setting up its dependencies.

Loading data

The first step in the pipeline deals with downloading and converting the input images from the JPG format to the TFRecord format. Notice that the records and validation_images arguments are both of type OutputBinaryFile(str), which represent files on disk that will be saved to object storage by Kubeflow Pipelines.

We store 10 validation images in the validation_images output. These validation images will later be used to ensure that the model is being served correctly. The actual conversion of the JPG files is handled by this script available in the TensorFlow models repository.

Training the model

The next step is to train the model. We start by downloading a model that was pretrained on the COCO Dataset. We then create a /pipeline.config file that is passed to the model_main.py file, which runs the training sequence. For this example, we have set the default number of training steps to be 1. The pipeline exposes a pipeline_steps parameter that can be used to increase this number for real-world workloads. After training, we export the updated model in a format that can be loaded by TensorFlow Serving. This updated model gets written to the file represented by the exported step argument, which means it will be stored in object storage by Kubeflow Pipelines, and will be available for the next step to use.

Testing the model

Now that we have a trained model, we can serve it and test that it is being served properly. To do this, we’ll have the testing step consist of a multi-container Pod. The main container has the testing code, and the sidecar container will serve the model with TensorFlow Serving. The main container extracts the model to /output/, which the sidecar has access to. The main container then waits for the sidecar to come up, before testing the metadata and predict endpoints to ensure that they respond appropriately

Defining the pipeline

Now that each step is defined, we can define the overall pipeline. Each step is created with its inputs, and then step.container.set_gpu_limit(1) is called on it, which requests 1 GPU for the step. This can be changed for multi-GPU workloads by increasing the number appropriately. We also set up the sidecar for the testing step with test.add_sidecar(serve_sidecar().set_gpu_limit(1))

def object_detection_pipeline(
    images='https://people.canonical.com/~knkski/images.tar.gz',
    annotations='https://github.com/canonical/bundle-kubeflow/blob/test-artifacts/tests/pipelines/artifacts/annotations.tar.gz',
    pretrained='https://people.canonical.com/~knkski/faster_rcnn_resnet101_coco_11_06_2017.tar.gz',
):
    loaded = load_task(images, annotations)
    loaded.container.set_gpu_limit(1)
    train = train_task(loaded.outputs['records'], pretrained)
    train.container.set_gpu_limit(1)

    test = test_task(train.outputs['exported'], loaded.outputs['validation_images'])
    test.add_sidecar(serve_sidecar().set_gpu_limit(1))

    dsl.get_pipeline_conf().add_op_transformer(attach_output_volume)

The complete, compiled YAML version of the GPU test pipeline is included in the kubeflow repository. You can download it directly from this link.

nobuto · 4 June 2022 11:54

evilnick:

def object_detection_pipeline(
    images='https://people.canonical.com/~knkski/images.tar.gz',
    annotations='https://people.canonical.com/~knkski/annotations.tar.gz',
    pretrained='https://people.canonical.com/~knkski/faster_rcnn_resnet101_coco_11_06_2017.tar.gz',
):

These links are no longer accessible with 404 Not Found.

Looks like it’s been reported to the resource part on Github.