Spark Client Snap Tutorial - Interactive Mode

deusebio · 17 March 2023 17:37

Run Spark Interactively

Apache Spark comes with two interactive shells:

spark-shell, that is built on top of the Scala REPL shell
pyspark, that is built on top of the python interpreter shell

Interactive Spark Shell

Supporting interactive usage, the spark-client snap ships with Apache Spark’s spark-shell utility.

It’s a useful tool to validate your assumptions about Spark in Scala before finding out within an actual long running job failure.

Let us test out our spark-shell setup with a simple example.

$ spark-client.spark-shell
....
....
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.3.2
      /_/
....
....
scala> import scala.math.random
scala> val slices = 1000
scala> val n = math.min(100000L * slices, Int.MaxValue).toInt
scala> val squares_sum = spark.sparkContext.parallelize(1 until n, slices).map { i => i * i }.reduce(_ + _)
scala> println(s"Sum of squares is ${squares_sum}")
scala> :quit

Interactive PySpark Shell

For interactive Python shell usage, spark-client snap ships with Apache Spark’s pyspark utility.

Make sure that Python is installed on your system. Then, execute the following commands to validate that your pyspark setup is working.

$ spark-client.pyspark
....
....
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.3.2
      /_/
....
....
>>> from operator import add
>>> partitions = 1000
>>> n = 100000 * partitions
>>> def square(x: int) -> float:
...     return x ** 2
...
>>> squares_sum = spark.sparkContext.parallelize(range(1, n + 1), partitions).map(square).reduce(add)
>>> print("Sum of squares is %f" % (squares_sum))
>>> quit()