This guide describes how to use Azure Kubernetes Service (AKS) Spot Virtual Machines (VMs) with Charmed Kubeflow (CKF). Spot virtual machines are an easy way to access extra computing on demand that can be leveraged for specific Machine Learning (ML) training.
You should use spot VMs for your workflows that are not time-sensitive and short tests and experiments where stability is less important than cost. For example, data processing, distributed training and hyperparameter tuning, model training and batch inference.
It is not recommended to use spot VMs for the Kubernetes control plane, notebooks and dashboards, databases or datastores like Minio, and model serving for online inference.
Setting up nodes is intended for system admins. Configuring and running workloads is intended for end users.
Requirements
- CKF deployed on an AKS cluster. See Deploy to AKS for more details.
Note that GPU- accelerated VMs are only available in certain regions which may differ from your chosen AKS cluster region.
Add an Azure spot node pool to AKS
-
Go to the Charmed Kubeflow AKS cluster overview page.
-
Click on
Node pools
under theSettings
dropdown on the left-hand sidebar. -
Click on
Add node pool
. -
In the next section, make sure to check the
Enable Azure Spot instances
check box. -
Configure the Azure spot virtual machine by clicking on
Configureā¦
.
Configure your Azure spot virtual machine
Eviction type and policy
You can specify when your VMs are evicted and how. If your workload has a maximum price over which it is not worth running it, you can define it in this section. See Understand eviction for more details.
VMs size and pricing
You can define the size and pricing of your VMs based on your workloads.
Labels
Labels identify the VM(s) that belong to a certain pool. It is recommended to add at least one label to better identify your VMs later on.
- Click on
Add
on the bottom left side of the page once your spot node pool is configured.
Run workflows in spot virtual machines
To schedule workloads, you have to assign their associated Pod(s) to the desired VM(s).
To do so, you have to configure your workloads. This configuration depends entirely on the workload type and the component that runs it. See the following examples.
Please make sure you follow Azureās suggestions for adding tolerations and affinities using the examples below.
Pipelines
Using the kfp
v2 SDK
You can use the kfp
v2 SDK to add a node selector constraint. To do so, use the following configuration:
import kfp
from kfp import dsl
def gpu_p100_op():
return dsl.ContainerOp(
name='check_p100',
image='tensorflow/tensorflow:latest-gpu',
command=['sh', '-c'],
arguments=['nvidia-smi']
).add_node_selector_constraint('cloud.my-cloud.com/gpu-accelerator', āaccelerator-nameā).container.set_gpu_limit(1)
Using the kfp
v1 SDK
You can use the kfp
v1 SDK to add affinity.
For this method, you have to create a V1Affinity
object, using the Kubernetes Python client, that has to be passed to the add_affinity
method of the kfp
SDK. It can be configured as follows:
from kubernetes.client.models import V1Affinity, V1NodeAffinity, V1NodeSelector, V1NodeSelectorTerm, V1NodeSelectorRequirement
spot_affinity = V1Affinity(node_affinity=V1NodeAffinity(
required_during_scheduling_ignored_during_execution=V1NodeSelector(
node_selector_terms=[V1NodeSelectorTerm(
match_expressions=[V1NodeSelectorRequirement(
key='a-custom-key.io/my-key',
operator='In',
values=['a-custom-value'])])]))
)
an_operation_to_be_scheduled_in_a_spot_instance = kfp.components.create_component_from_func(...)
an_operation_to_be_scheduled_in_a_spot_instance.add_affinity(spot_affinity)
Training jobs
When using YAMLs
You can modify training job yaml
representations to have node selectors or affinity. See training jobs for more details.
A training job usually has different processes, such as the āChiefā, āWorkerā, or āParameter Serverā. Each of them has their own spec
, where node affinity or node selectors can be defined, provided that a spot virtual machine is labelled.
For example, you can define a TFJob yaml
representation as follows:
apiVersion: kubeflow.org/v1
kind: TFJob
metadata:
generateName: tfjob
namespace: your-user-namespace
spec:
tfReplicaSpecs:
PS:
...
spec:
...
Worker:
spec:
...
Using nodeAffinity
To add nodeAffinity
to a training job you can edit the specific process or processesā spec field ensuring to match matchExpressions
with the spot node pool:
Worker:
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: a-custom-key.io/my-key
operator: In
values:
- a-custom-value
Using nodeSelector
You can add a nodeSelector
to ensure specific processes are scheduled in the desired node pool as follows:
Worker:
spec:
containers:
- name: my-training-process
image: training:0.1
nodeSelector:
a-custom-key.io/my-key: a-custom-value
When using the SDK
When using the kubeflow.training
SDK in a Notebook, for example, the training job can be edited similar to how the pure yaml
would be. Thatās because the training jobs use the Python Kubernetes Client to define each component.
Creating a TFJob
requires you to do the following:
container = V1Container(...)
worker = V1ReplicaSpec(..., spec=V1PodSpec(containers=container))
tfjob = KubeflowOrgV1TFJob(
api_version="kubeflow.org/v1",
kind="TFJob",
metadata=V1ObjectMeta(name="mnist",namespace=namespace),
spec=KubeflowOrgV1TFJobSpec(
clean_pod_policy="None",
tf_replica_specs={"Worker": worker}
)
)
Based on the above, to schedule the worker
process in a spot virtual machine, the v1PodSpec
object has to have either a nodeAffinity
or nodeSelector
. For example:
worker = V1ReplicaSpec(
spec=V1PodSpec(containers=container),
affinity=V1Affinity(...)
)
Katib hyperparameter tuning
When using YAMLs
You can schedule a Katib hyperparameter tuning experiment to a specific VM by adding nodeAffinity
or nodeSelector
s.
These fields should be added to the yaml
representation of the experiment, at the trialSpec
level.
Using nodeAffinity
trialSpec:
apiVersion: batch/v1
kind: Job
spec:
template:
spec:
containers:
- name: training-container
image: image-name
command:
- "a-command"
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: a-custom-key.io/my-key
operator: In
values:
- a-custom-value
restartPolicy: Never
Using nodeSelector
trialSpec:
apiVersion: batch/v1
kind: Job
spec:
containers:
- name: my-training-process
image: training:0.1
nodeSelector:
a-custom-key.io/my-key: a-custom-value
See Trial Templates for more information.
When using the SDK
When using the kubeflow.katib
SDK in a Notebook, for example, the experiment can be edited similar to how the pure yaml
would be. Thatās because the trial worker spec is usually defined as a JSON template of a Kubernetes Job
:
trial_spec={
"apiVersion": "batch/v1",
"kind": "Job",
"spec": {
"template": {
"metadata": {
"annotations": {
"sidecar.istio.io/inject": "false"
}
},
"spec": {
"containers": [...]
"restartPolicy": "Never"
}
}
}
}
This is where nodeAffinity
or a nodeSelector
could be added to ensure this workload is scheduled in the desired spot virtual machine.
Best practices for evicting workloads from a spot virtual machine
These are a few recommendations for handling workloads eviction from spot VMs:
- Set
terminationGracePeriodSeconds
to allow the workloads to terminate gracefully. For instance, if you have access to theyaml
representation of a workload, you can do the following:
spec:
nodeSelector:
my-key.io/key: "value"
terminationGracePeriodSeconds: 25
- Configure the processes to finalise gracefully by modifying the code. For example, the
kfp
SDK offers theset_retry()
method for setting retries.