Spark Client Snap Tutorial - Spark Submit

Spark Job Submission To Kubernetes Cluster

The spark-client snap contains the Apache Spark spark-submit utility for Kubernetes distribution.

Pre-requisites

Before being able to use the spark-submit utility, make sure that you have a service account available. Note that for running applications as outlined in the following guide, you DON’T need to have administrative rights on the kubernetes cluster. The service account created by an administrator (more details on the functionalities here) already provides the minimal set of permission to run Spark jobs on the associated namespace on K8s.

Validating Setup with an Example Spark Job

In order to test the validity of your setup, we can launch the Pi example bundled with Apache Spark:

SPARK_EXAMPLES_JAR_NAME='spark-examples_2.12-3.4.1.jar'
        
spark-client.spark-submit \
--deploy-mode cluster \
--class org.apache.spark.examples.SparkPi \
local:///opt/spark/examples/jars/$SPARK_EXAMPLES_JAR_NAME 100

Note In case executor pods fail to schedule due to insufficient CPU resources (either locally or in CI/CD pipelines), issue fractional CPU requests.

The command above is using the default (spark) user. Following the example from the previous chapter, the command has two more parameters (--username, --namespace)… However since we’ve already set the same deploy-mode defaults for the user, that parameter can be skipped.

Such as:

spark-client.spark-submit \
--username demouser \
--namespace demonamespace \
--class org.apache.spark.examples.SparkPi \
local:///opt/spark/examples/jars/$SPARK_EXAMPLES_JAR_NAME 100

In case you’d like to monitor your submission, you could easily do it on the level of K8 pods. Typically:

$ kubectl get pod
org-apache-spark-examples-sparkpi-bd526f87e1deb586-driver   0/1     Completed     0             18h
spark-pi-32f7f187e5c9ea7f-exec-3                            0/1     Terminating   0             2m8s
$ kubectl logs -f org-apache-spark-examples-sparkpi-bd526f87e1deb586-driver

Adding Big Data to the mix

It’s time to test out with a real big data workload. Here we assume that

  • the input data is placed in S3
  • the code i.e. python script is also placed in S3 and reads from the provided input location.
  • the destination directory for output is also in S3

To launch the Spark job in this scenario, make sure the S3 access related information is available to you. Then execute the following commands.

APP_NAME='my-pyspark-app'
NUM_INSTANCES=5
NAMESPACE=<namespace for your spark K8s service account>
K8S_SERVICE_ACCOUNT_FOR_SPARK=<your spark K8s service account>

S3_ACCESS_KEY=<your s3 access key>
S3_SECRET_KEY=<your s3 secret key>
S3A_CREDENTIALS_PROVIDER_CLASS=org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider
S3A_ENDPOINT=<your s3 endpoint>
S3A_SSL_ENABLED=false
S3_PATH_FOR_CODE_PY_FILE=</path/to/your/python_script_in_S3.py>

spark-client.spark-submit --deploy-mode cluster --name $APP_NAME \
--conf spark.executor.instances=$NUM_INSTANCES \
--conf spark.kubernetes.namespace=$NAMESPACE \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=$K8S_SERVICE_ACCOUNT_FOR_SPARK \
--conf spark.hadoop.fs.s3a.access.key=$S3_ACCESS_KEY \
--conf spark.hadoop.fs.s3a.secret.key=$S3_SECRET_KEY \
--conf spark.hadoop.fs.s3a.aws.credentials.provider=$S3A_CREDENTIALS_PROVIDER_CLASS \
--conf spark.hadoop.fs.s3a.endpoint=$S3A_ENDPOINT \
--conf spark.hadoop.fs.s3a.connection.ssl.enabled=$S3A_SSL_ENABLED \
--conf spark.hadoop.fs.s3a.path.style.access=true \
$S3_PATH_FOR_CODE_PY_FILE

Such configuration parameters can be provided via a spark-defaults.conf config file placed as described here.

  • Either setting SPARK_HOME and placing the config as $SPARK_HOME/conf/spark-defaults.conf. Or
  • Overriding SPARK_CONFS and placing the config as $SPARK_CONFS/spark-defaults.conf

For example, with a spark-defaults.conf as provided below for reference, we can make the submit command much simpler.

spark.master=k8s://https://<MY_K8S_CONTROL_PLANE_HOST_IP>:<MY_K8S_CONTROL_PLANE_PORT>
spark.kubernetes.context=<PREFERRED_K8S_CONTEXT>
spark.app.name=<SPARK_APP_NAME>
spark.executor.instances=<NUM_INSTANCES>
spark.kubernetes.container.image=<CONTAINER_IMAGE_PUBLIC_REF>
spark.kubernetes.container.image.pullPolicy=<PULL_POLICY>
spark.kubernetes.namespace=<NAMESPACE_OF_PREFERRED_SERVICEACCOUNT>
spark.kubernetes.authenticate.driver.serviceAccountName=<PREFERRED_SERVICEACCOUNT>
spark.eventLog.enabled=false
spark.hadoop.fs.s3a.access.key=<S3_ACCESS_KEY>
spark.hadoop.fs.s3a.secret.key=s<S3_SECRET_KEY>
spark.hadoop.fs.s3a.aws.credentials.provider=org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider
spark.hadoop.fs.s3a.endpoint=<S3_ENDPOINT_URI>
spark.hadoop.fs.s3a.connection.ssl.enabled=false
spark.hadoop.fs.s3a.path.style.access=true

With a valid configuration file placed appropriately, the submit command becomes straightforward:

spark-client.spark-submit --deploy-mode cluster $S3_PATH_FOR_CODE_PY_FILE

The configuration defaults can be overridden as well in the submit command with --conf arguments as demonstrated previously.


Hi @deusebio

I am trying to run the example and I can see the pod terminated but I am seeing something odd in the pod logs

$ kubectl logs demo-spark-app-7210798a1f8a2c11-driver -n demonamespace
# [..]
2023-08-22T23:19:21.954Z [entrypoint] Files  local:///opt/spark/examples/jars/spark-examples_2.12-3.3.2.jar from /opt/spark/examples/jars/spark-examples_2.12-3.3.2.jar to /opt/spark/./spark-examples_2.12-3.3.2.jar
2023-08-22T23:19:21.957Z [entrypoint] Exception in thread "main" java.nio.file.NoSuchFileException: /opt/spark/examples/jars/spark-examples_2.12-3.3.2.jar
# [..]

Does this file (/opt/spark/examples/jars/$SPARK_EXAMPLES_JAR_NAME need to exist on the host where my spark-client snap is?

If so, where can I can get that file?

Here full pod logs and spark-client logs

Hi @gustavosr98!

When you run the workload in with --deploy-mode cluster, the Spark driver is running in Kubernetes. See here for more information.

Therefore, when using the local:... prefix, if you use --deploy-mode cluster, the class needs to be present in the spark image you are running, whereas if you were using --deploy-mode client, the class needs to be present locally.

Having said this, I have realized that the documentation is not up to date with respect to the Spark version, as we now support the Spark 3.4.1, therefore the name should be SPARK_EXAMPLES_JAR_NAME='spark-examples_2.12-3.4.1.jar.

I have update the docs! Thanks for flagging this!

1 Like

Hi @deusebio,

Would you happen to have an example python script for the Big Data / S3 portion of the tutorial?

1 Like

I am giving it a try with a simple python hello world. I am unsure it needs anything specific in there.

Most probably unrelated but I am getting an issue with the service account

APP_NAME='demo-spark-app'
NUM_INSTANCES=5
NAMESPACE=demonamespace
K8S_SERVICE_ACCOUNT_FOR_SPARK=ubuntu

S3_ACCESS_KEY=minio
S3_SECRET_KEY=miniominio
S3A_CREDENTIALS_PROVIDER_CLASS=org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider
S3A_ENDPOINT="http://10.152.183.226:9000"
S3A_SSL_ENABLED=false
S3_PATH_FOR_CODE_PY_FILE='./test.py'

spark-client.spark-submit --deploy-mode cluster --username $K8S_SERVICE_ACCOUNT_FOR_SPARK --namespace $NAMESPACE --name $APP_NAME \
--conf spark.executor.instances=$NUM_INSTANCES \
--conf spark.kubernetes.namespace=$NAMESPACE \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=$K8S_SERVICE_ACCOUNT_FOR_SPARK \
--conf spark.hadoop.fs.s3a.access.key=$S3_ACCESS_KEY \
--conf spark.hadoop.fs.s3a.secret.key=$S3_SECRET_KEY \
--conf spark.hadoop.fs.s3a.aws.credentials.provider=$S3A_CREDENTIALS_PROVIDER_CLASS \
--conf spark.hadoop.fs.s3a.endpoint=$S3A_ENDPOINT \
--conf spark.hadoop.fs.s3a.connection.ssl.enabled=$S3A_SSL_ENABLED \
--conf spark.hadoop.fs.s3a.path.style.access=true \
$S3_PATH_FOR_CODE_PY_FILE

Output

2023-08-25 17:37:56.000+0000 ERROR [spark8t.cli.spark_submit] (MainThread) (<module>) Account ubuntu not found

Listing service account

$ spark-client.service-account-registry list
2023-08-25 17:38:26.000+0000 INFO [spark8t.cli.service_account_registry] (MainThread) (main) Using K8s context: microk8s
2023-08-25 17:38:26.000+0000 INFO [spark8t.cli.service_account_registry] (MainThread) (main) demonamespace:ubuntu    False

Any hints?

Uhm, strange. Can you try to also open a pyspark or a scala shell (with spark-client.pyspark or spark-client.spark-shell) using the same spark service account? Can you also make sure the resources indeed exists by using kubectl get sa -n demonamespace and kubectl get secrets -n demonamespace?

Also just one tip: maybe you should store all those configuration in the service account using:

spark-client.service-account-registry add-config (see here) such that you don’t have to type that all the times