Running Spark from Kubeflow notebooks (namespace integration issue)

Hi all,

When installing Kubeflow, it prompts me to create an initial namespace. Spark, however, is deployed in a different namespace.

From Kubeflow Jupyter notebooks, I’m unable to call Spark (e.g., start a SparkSession) because the services are isolated. The Kubeflow UI doesn’t seem to allow switching or importing another namespace.

Questions:

  • Is there a way to run Spark jobs from Kubeflow notebooks if Spark is in another namespace?
  • Should Spark be deployed into the same namespace, or is there a recommended cross-namespace setup (RBAC, service exposure, etc.)?

thanks

Hi @afrogrit .

Yes, the issue is that every namespace of Kubeflow is registered under Istio. We have already tested in the past that one can both run jupyter notebooks that are spark powered (where the notebook is the driver, and then you also spawn - together with the driver - also the executors where data will be cached and processed) or have pipeline steps that leverage on Spark parallelization.

However, in order to do this, you have to instruct the Istio on the spark pods to enable traffic on the ports used for Spark driver <> executor communication. You could do this by using pod defaults.

Here is the pod defaults that we used in the past for jupyter+pipelines (and assumes that you have created a spark service account in the selected namespace, like using spark-client-snap see here):

apiVersion: kubeflow.org/v1alpha1
kind: PodDefault
metadata:
    name: pyspark
spec:
    annotations:
        traffic.sidecar.istio.io/excludeInboundPorts: 37371,6060
        traffic.sidecar.istio.io/excludeOutboundPorts: 37371,6060
    env:
    - name: POD_NAMESPACE
      valueFrom:
        fieldRef:
          fieldPath: metadata.namespace
    - name: SPARK_SVC_ACCOUNT
      value: spark
    args:
    - --namespace
    - $(POD_NAMESPACE)
    - --username
    - spark
    - --conf
    - spark.driver.port=37371
    - --conf
    - spark.kubernetes.container.image=ghcr.io/canonical/charmed-spark@sha256:1d9949dc7266d814e6483f8d9ffafeff32f66bb9939e0ab29ccfd9d5003a583a
    - --conf
    - spark.blockManager.port=6060
    desc: Configure Canonical PySpark
    selector:
        matchLabels:
            canonical-pyspark: "true"

Let me know if this clarifies things a bit. Also, a few months ago, @paolosottovia, @theoctober19th, @michalhucko and @nohaihab have experimented a bit the integration and can probably provide more details if needed.

Cheers, Enrico

Thank you @deusebio

this is a good start, however I worked with spark-client snap for a few days then

spark.metrics.conf.*.sink.prometheus.class=org.apache.spark.banzaicloud.metrics.sink.PrometheusSink

error began and never left, although spark is running in cluster mode i.e. inside K8s

I wrote in the forums and @theoctober19th mentioned that spark-client snap is not supported.

if I’m to follow the above I need to have the spark client running in order to edit the pod default right?

Hi

What @theoctober19th meant here is that the monitoring configuration are supposed to be used only when drivers are pods in Kubernetes. When you install the spark client locally and you use it to run pyspark or spark-shell or spark-submit with deploy-mode=client, the driver will be running locally, not on K8s, and monitoring should not be used (since most likely the spark-client would also not resolve the Prometheus push gateway endpoint to be used to send the metrics)

On the other hand, when you run a K8s pod on Kubernetes (as above), you can use the charmed spark image that ships with some script/alias that mimic the snap entry points, such as ‘spark-client.pyspark’, and it also bundles all the binaries to send logs/metrics to Prometheus. Also, you are now on K8s, and monitoring will also work as Prometheus push gateway endpoints will be reachable

Does this explain a bit more?

error began and never left, although spark is running in cluster mode i.e. inside K8s

This doesn’t sound right. Could you provide your config with spark-client.get-config, the spark submit command and the logs that you get?

Thanks @deusebio

I haven’t not customised the setup a lot as yet, trying to get the most out of the installation docs, for instance is there a way to disable metrics on the snap spark.client? below are the logs

spark-client.spark-shell

spark-client.spark-submit

So, the spark-submit seems to be working from the logs that you shared, not the spark-shell (as expected, given the above).

When using the spark-shell, you could either create a dedicated service account only for running shell commands that does not include any monitoring, or you can also reset (and toggle off) the monitoring configuration by supplying an empty value, e.g. --conf spark.metrics.conf.driver.sink.prometheus.class=

BTW…just to give you the heads up, in the coming months we are planning to work on a native integration between Spark and Kubeflow, that should make the process waaaay easier, just by relating a couple of charms. So hopefully in a few months (say end of Jan), we should have a much better story around this