Charmed Apache Spark K8s Documentation - Enabling GPU acceleration

paolosottovia · 24 July 2024 09:00

Enabling GPU acceleration

The Charmed Apache Spark solution offers an OCI image that supports the Apache Spark Rapids plugin that enables GPU acceleration on Spark jobs.

Setup

After installing spark-client and Microk8s with the GPU addon enabled, now we can look into how to launch Spark jobs with GPU in Kubernetes.

First, we need to create a pod template to limit the amount of GPU per container.

Edit the pod manifest file (we’ll refer to it as gpu_executor_template.yaml) by adding the following content:

apiVersion: v1
kind: Pod
spec:
  containers:
    - name: executor
      resources:
        limits:
          nvidia.com/gpu: 1

Submitting a Spark job with GPU acceleration

With the usage of the spark-client snap, we can submit the desired Spark job adding some configuration options for enabling GPU acceleration:

spark-client.spark-submit \
    ... \ 
    --conf spark.executor.resource.gpu.amount=1 \
    --conf spark.task.resource.gpu.amount=1 \
    --conf spark.rapids.memory.pinnedPool.size=1G \
    --conf spark.plugins=com.nvidia.spark.SQLPlugin \
    --conf spark.executor.resource.gpu.discoveryScript=/opt/getGpusResources.sh \
    --conf spark.executor.resource.gpu.vendor=nvidia.com \
    --conf spark.kubernetes.container.image=ghcr.io/canonical/charmed-spark-gpu:3.4-22.04_edge \
    --conf spark.kubernetes.executor.podTemplateFile=gpu_executor_template.yaml
    ...

The Apache Spark configuration options can also be set at the service account level using the Spark Client snap to use them on every job. Please refer to the guide on how to manage options at the service account level. To have more information on how the Apache Spark Client manages configuration options please refer to the explanation section.

The options above are the minimal set that is needed to enable the Apache Spark Rapids plugin. For more information on available options, see the full list.

gustavosr98 · 13 August 2025 03:29

Thanks for the tutorial

I am running a simple example I translated from another Charmed Spark tutorial for ubuntu-count.py to be used with RAPIDS

import time
from pyspark.sql import SparkSession, functions as F

start_time = time.time()

spark = SparkSession\
        .builder\
        .appName("CountUbuntuTweetsGPU")\
        .getOrCreate()



# DataFrame API (GPU-accelerable), no Python UDFs/RDDs
df = spark.read.option("header", True).csv("s3a://project-c/spark/twitter.csv")

count = (
    df.filter(F.lower(F.col("text")).contains("ubuntu")).count()
)

end_time = time.time()
out = f"Number of tweets containing Ubuntu: {count} (Took {end_time - start_time:.2f}s)"
print(out)

# Write a tiny DF instead of RDD output
spark.createDataFrame([(out,)], ["value"]).coalesce(1).write.mode("overwrite").text("s3a://project-c/spark/twitter.out")

spark.stop()

But, I am getting the exec pods OOMKilled So, I updated the template to request more resources

apiVersion: v1
kind: Pod
spec:
  containers:
    - name: executor
      resources:
        requests:
          cpu: "8"
          memory: "24Gi"
          nvidia.com/gpu: 1
        limits:
          cpu: "8"
          memory: "24Gi"
          nvidia.com/gpu: 1

But still dying on me

# ks get po -w
NAME                                        READY   STATUS    RESTARTS   AGE
count-ubuntu-py-cca56b98a1772240-driver     1/1     Running   0          11s
countubuntutweets-8ee79198a1773d02-exec-1   1/1     Running   0          5s
countubuntutweets-8ee79198a1773d02-exec-2   0/1     Pending   0          5s
countubuntutweets-8ee79198a1773d02-exec-1   0/1     OOMKilled   0          34s
countubuntutweets-8ee79198a1773d02-exec-1   0/1     OOMKilled   0          35s
countubuntutweets-8ee79198a1773d02-exec-2   0/1     Pending     0          36s
countubuntutweets-8ee79198a1773d02-exec-2   0/1     ContainerCreating   0          36s
countubuntutweets-8ee79198a1773d02-exec-1   0/1     OOMKilled           0          36s
countubuntutweets-8ee79198a1773d02-exec-1   0/1     OOMKilled           0          36s
countubuntutweets-8ee79198a1773d02-exec-3   0/1     Pending             0          1s
countubuntutweets-8ee79198a1773d02-exec-3   0/1     Pending             0          1s
countubuntutweets-8ee79198a1773d02-exec-2   1/1     Running             0          37s
countubuntutweets-8ee79198a1773d02-exec-2   0/1     OOMKilled           0          45s
countubuntutweets-8ee79198a1773d02-exec-2   0/1     OOMKilled           0          46s
countubuntutweets-8ee79198a1773d02-exec-3   0/1     Pending             0          11s
countubuntutweets-8ee79198a1773d02-exec-4   0/1     Pending             0          1s
countubuntutweets-8ee79198a1773d02-exec-4   0/1     Pending             0          1s
countubuntutweets-8ee79198a1773d02-exec-3   0/1     ContainerCreating   0          11s
countubuntutweets-8ee79198a1773d02-exec-2   0/1     OOMKilled           0          47s
countubuntutweets-8ee79198a1773d02-exec-2   0/1     OOMKilled           0          47s
countubuntutweets-8ee79198a1773d02-exec-3   1/1     Running             0          12s
countubuntutweets-8ee79198a1773d02-exec-3   0/1     OOMKilled           0          20s
countubuntutweets-8ee79198a1773d02-exec-3   0/1     OOMKilled           0          21s
countubuntutweets-8ee79198a1773d02-exec-5   0/1     Pending             0          1s
countubuntutweets-8ee79198a1773d02-exec-4   0/1     Pending             0          12s
countubuntutweets-8ee79198a1773d02-exec-5   0/1     Pending             0          1s
countubuntutweets-8ee79198a1773d02-exec-4   0/1     ContainerCreating   0          12s
countubuntutweets-8ee79198a1773d02-exec-3   0/1     OOMKilled           0          23s
countubuntutweets-8ee79198a1773d02-exec-3   0/1     OOMKilled           0          23s
countubuntutweets-8ee79198a1773d02-exec-4   1/1     Running             0          13s
countubuntutweets-8ee79198a1773d02-exec-4   0/1     OOMKilled           0          21s
countubuntutweets-8ee79198a1773d02-exec-4   0/1     OOMKilled           0          22s
countubuntutweets-8ee79198a1773d02-exec-5   0/1     Pending             0          12s
countubuntutweets-8ee79198a1773d02-exec-6   0/1     Pending             0          1s
countubuntutweets-8ee79198a1773d02-exec-6   0/1     Pending             0          1s
countubuntutweets-8ee79198a1773d02-exec-5   0/1     ContainerCreating   0          12s
countubuntutweets-8ee79198a1773d02-exec-4   0/1     OOMKilled           0          23s
countubuntutweets-8ee79198a1773d02-exec-4   0/1     OOMKilled           0          23s
countubuntutweets-8ee79198a1773d02-exec-5   0/1     Terminating         0          12s
countubuntutweets-8ee79198a1773d02-exec-6   0/1     Terminating         0          1s
countubuntutweets-8ee79198a1773d02-exec-6   0/1     Terminating         0          1s
countubuntutweets-8ee79198a1773d02-exec-5   1/1     Terminating         0          13s
count-ubuntu-py-cca56b98a1772240-driver     0/1     Error               0          76s
countubuntutweets-8ee79198a1773d02-exec-5   0/1     Completed           0          13s
countubuntutweets-8ee79198a1773d02-exec-5   0/1     Completed           0          14s
countubuntutweets-8ee79198a1773d02-exec-5   0/1     Completed           0          14s
count-ubuntu-py-cca56b98a1772240-driver     0/1     Error               0          78s

Any suggestions on the config options I am giving the job submit?

spark-client.spark-submit \
    --username $USERNAME --namespace $NS \
    --deploy-mode cluster \
    s3a://project-c/spark/count-ubuntu.py \
    \
    --conf spark.executor.resource.gpu.amount=1 \
    --conf spark.task.resource.gpu.amount=1 \
    --conf spark.rapids.memory.pinnedPool.size=1G \
    --conf spark.plugins=com.nvidia.spark.SQLPlugin \
    --conf spark.executor.resource.gpu.discoveryScript=/opt/getGpusResources.sh \
    --conf spark.executor.resource.gpu.vendor=nvidia.com \
    --conf spark.kubernetes.container.image=ghcr.io/canonical/charmed-spark-gpu:3.4-22.04_edge \
    --conf spark.kubernetes.executor.podTemplateFile=gpu_executor_template.yaml

gustavosr98 · 13 August 2025 03:42

Mmmm, looks like I am pointing to the old CPU based script s3a://project-c/spark/count-ubuntu.py

Updating to s3a://project-c/spark/count-ubuntu-gpu.py and checking again

gustavosr98 · 15 August 2025 21:12

Thanks @paolosottovia

Looks like podTemplate ignores CPU and memory To specify those reousces use these configs

    --conf spark.executor.cores=2 \
    --conf spark.executor.memory=8G

Also considerer specifying the amount of executor instances depending on the amount of GPUs you have available in your cluster

    --conf spark.executor.instances=1 \