Enabling GPU acceleration
The Charmed Apache Spark solution offers an OCI image that supports the Apache Spark Rapids plugin that enables GPU acceleration on Spark jobs.
Setup
After installing spark-client and Microk8s with the GPU addon enabled, now we can look into how to launch Spark jobs with GPU in Kubernetes.
First, we need to create a pod template to limit the amount of GPU per container.
Edit the pod manifest file (we’ll refer to it as gpu_executor_template.yaml
) by adding the following content:
apiVersion: v1
kind: Pod
spec:
containers:
- name: executor
resources:
limits:
nvidia.com/gpu: 1
Submitting a Spark job with GPU acceleration
With the usage of the spark-client
snap, we can submit the desired Spark job adding some configuration options for enabling GPU acceleration:
spark-client.spark-submit \
... \
--conf spark.executor.resource.gpu.amount=1 \
--conf spark.task.resource.gpu.amount=1 \
--conf spark.rapids.memory.pinnedPool.size=1G \
--conf spark.plugins=com.nvidia.spark.SQLPlugin \
--conf spark.executor.resource.gpu.discoveryScript=/opt/getGpusResources.sh \
--conf spark.executor.resource.gpu.vendor=nvidia.com \
--conf spark.kubernetes.container.image=ghcr.io/canonical/charmed-spark-gpu:3.4-22.04_edge \
--conf spark.kubernetes.executor.podTemplateFile=gpu_executor_template.yaml
...
The Apache Spark configuration options can also be set at the service account level using the Spark Client snap to use them on every job. Please refer to the guide on how to manage options at the service account level. To have more information on how the Apache Spark Client manages configuration options please refer to the explanation section.
The options above are the minimal set that is needed to enable the Apache Spark Rapids plugin.
For more information on available options, see the full list.
Thanks for the tutorial
I am running a simple example I translated from another Charmed Spark tutorial for ubuntu-count.py to be used with RAPIDS
import time
from pyspark.sql import SparkSession, functions as F
start_time = time.time()
spark = SparkSession\
.builder\
.appName("CountUbuntuTweetsGPU")\
.getOrCreate()
# DataFrame API (GPU-accelerable), no Python UDFs/RDDs
df = spark.read.option("header", True).csv("s3a://project-c/spark/twitter.csv")
count = (
df.filter(F.lower(F.col("text")).contains("ubuntu")).count()
)
end_time = time.time()
out = f"Number of tweets containing Ubuntu: {count} (Took {end_time - start_time:.2f}s)"
print(out)
# Write a tiny DF instead of RDD output
spark.createDataFrame([(out,)], ["value"]).coalesce(1).write.mode("overwrite").text("s3a://project-c/spark/twitter.out")
spark.stop()
But, I am getting the exec pods OOMKilled
So, I updated the template to request more resources
apiVersion: v1
kind: Pod
spec:
containers:
- name: executor
resources:
requests:
cpu: "8"
memory: "24Gi"
nvidia.com/gpu: 1
limits:
cpu: "8"
memory: "24Gi"
nvidia.com/gpu: 1
But still dying on me
# ks get po -w
NAME READY STATUS RESTARTS AGE
count-ubuntu-py-cca56b98a1772240-driver 1/1 Running 0 11s
countubuntutweets-8ee79198a1773d02-exec-1 1/1 Running 0 5s
countubuntutweets-8ee79198a1773d02-exec-2 0/1 Pending 0 5s
countubuntutweets-8ee79198a1773d02-exec-1 0/1 OOMKilled 0 34s
countubuntutweets-8ee79198a1773d02-exec-1 0/1 OOMKilled 0 35s
countubuntutweets-8ee79198a1773d02-exec-2 0/1 Pending 0 36s
countubuntutweets-8ee79198a1773d02-exec-2 0/1 ContainerCreating 0 36s
countubuntutweets-8ee79198a1773d02-exec-1 0/1 OOMKilled 0 36s
countubuntutweets-8ee79198a1773d02-exec-1 0/1 OOMKilled 0 36s
countubuntutweets-8ee79198a1773d02-exec-3 0/1 Pending 0 1s
countubuntutweets-8ee79198a1773d02-exec-3 0/1 Pending 0 1s
countubuntutweets-8ee79198a1773d02-exec-2 1/1 Running 0 37s
countubuntutweets-8ee79198a1773d02-exec-2 0/1 OOMKilled 0 45s
countubuntutweets-8ee79198a1773d02-exec-2 0/1 OOMKilled 0 46s
countubuntutweets-8ee79198a1773d02-exec-3 0/1 Pending 0 11s
countubuntutweets-8ee79198a1773d02-exec-4 0/1 Pending 0 1s
countubuntutweets-8ee79198a1773d02-exec-4 0/1 Pending 0 1s
countubuntutweets-8ee79198a1773d02-exec-3 0/1 ContainerCreating 0 11s
countubuntutweets-8ee79198a1773d02-exec-2 0/1 OOMKilled 0 47s
countubuntutweets-8ee79198a1773d02-exec-2 0/1 OOMKilled 0 47s
countubuntutweets-8ee79198a1773d02-exec-3 1/1 Running 0 12s
countubuntutweets-8ee79198a1773d02-exec-3 0/1 OOMKilled 0 20s
countubuntutweets-8ee79198a1773d02-exec-3 0/1 OOMKilled 0 21s
countubuntutweets-8ee79198a1773d02-exec-5 0/1 Pending 0 1s
countubuntutweets-8ee79198a1773d02-exec-4 0/1 Pending 0 12s
countubuntutweets-8ee79198a1773d02-exec-5 0/1 Pending 0 1s
countubuntutweets-8ee79198a1773d02-exec-4 0/1 ContainerCreating 0 12s
countubuntutweets-8ee79198a1773d02-exec-3 0/1 OOMKilled 0 23s
countubuntutweets-8ee79198a1773d02-exec-3 0/1 OOMKilled 0 23s
countubuntutweets-8ee79198a1773d02-exec-4 1/1 Running 0 13s
countubuntutweets-8ee79198a1773d02-exec-4 0/1 OOMKilled 0 21s
countubuntutweets-8ee79198a1773d02-exec-4 0/1 OOMKilled 0 22s
countubuntutweets-8ee79198a1773d02-exec-5 0/1 Pending 0 12s
countubuntutweets-8ee79198a1773d02-exec-6 0/1 Pending 0 1s
countubuntutweets-8ee79198a1773d02-exec-6 0/1 Pending 0 1s
countubuntutweets-8ee79198a1773d02-exec-5 0/1 ContainerCreating 0 12s
countubuntutweets-8ee79198a1773d02-exec-4 0/1 OOMKilled 0 23s
countubuntutweets-8ee79198a1773d02-exec-4 0/1 OOMKilled 0 23s
countubuntutweets-8ee79198a1773d02-exec-5 0/1 Terminating 0 12s
countubuntutweets-8ee79198a1773d02-exec-6 0/1 Terminating 0 1s
countubuntutweets-8ee79198a1773d02-exec-6 0/1 Terminating 0 1s
countubuntutweets-8ee79198a1773d02-exec-5 1/1 Terminating 0 13s
count-ubuntu-py-cca56b98a1772240-driver 0/1 Error 0 76s
countubuntutweets-8ee79198a1773d02-exec-5 0/1 Completed 0 13s
countubuntutweets-8ee79198a1773d02-exec-5 0/1 Completed 0 14s
countubuntutweets-8ee79198a1773d02-exec-5 0/1 Completed 0 14s
count-ubuntu-py-cca56b98a1772240-driver 0/1 Error 0 78s
Any suggestions on the config options I am giving the job submit?
spark-client.spark-submit \
--username $USERNAME --namespace $NS \
--deploy-mode cluster \
s3a://project-c/spark/count-ubuntu.py \
\
--conf spark.executor.resource.gpu.amount=1 \
--conf spark.task.resource.gpu.amount=1 \
--conf spark.rapids.memory.pinnedPool.size=1G \
--conf spark.plugins=com.nvidia.spark.SQLPlugin \
--conf spark.executor.resource.gpu.discoveryScript=/opt/getGpusResources.sh \
--conf spark.executor.resource.gpu.vendor=nvidia.com \
--conf spark.kubernetes.container.image=ghcr.io/canonical/charmed-spark-gpu:3.4-22.04_edge \
--conf spark.kubernetes.executor.podTemplateFile=gpu_executor_template.yaml
Mmmm, looks like I am pointing to the old CPU based script s3a://project-c/spark/count-ubuntu.py

Updating to s3a://project-c/spark/count-ubuntu-gpu.py
and checking again
Thanks @paolosottovia
Looks like podTemplate ignores CPU and memory
To specify those reousces use these configs
--conf spark.executor.cores=2 \
--conf spark.executor.memory=8G
Also considerer specifying the amount of executor instances depending on the amount of GPUs you have available in your cluster
--conf spark.executor.instances=1 \