Charmed Apache Spark Documentation - Tutorial Setup Environment

theoctober19th · 21 February 2024 07:12

Set up the environment for the tutorial

This section of the tutorial will guide you through the initial environment setup.

Minimum requirements

Before we start, make sure your machine meets the following requirements:

Ubuntu 22.04 (jammy) or later (the tutorial has been prepared and tested to work on 22.04)
8 GB of RAM
2 CPU threads
At least 20GB of available storage.
Access to the internet for downloading the required snaps and charms.

Prepare MicroK8s

Charmed Apache Spark is developed to be run on top of a Kubernetes cluster. For this tutorial, we are going to use MicroK8s, a very simple production-grade conformant K8s that can run locally.

Installing MicroK8s is as simple as running the following command:

sudo snap install microk8s --channel=1.28-strict/stable

Let’s configure MicroK8s so that the currently logged-in user has admin rights to the cluster.

# Set an alias 'kubectl' that can be used instead of microk8s.kubectl
sudo snap alias microk8s.kubectl kubectl

# Add the current user into 'microk8s' group
sudo usermod -a -G snap_microk8s ${USER}

# Create and provide ownership of '~/.kube' directory to current user
mkdir -p ~/.kube
sudo chown -f -R ${USER} ~/.kube

# Put the group membership changes into effect
newgrp snap_microk8s

Once done, the status of the MicroK8s can be verified with

microk8s status --wait-ready

When MicroK8s cluster is running and ready, you should see an output similar to the following:

microk8s is running
high-availability: no
...
addons:
  enabled:
    dns                  # (core) CoreDNS
    ha-cluster           # (core) Configure high availability on the current node
    helm                 # (core) Helm - the package manager for Kubernetes
    helm3                # (core) Helm 3 - the package manager for Kubernetes
  disabled:
    cert-manager         # (core) Cloud native certificate management
...

Let’s generate the Kubernetes configuration file using MicroK8s and write it to ~/.kube/config. This is where Kubernetes looks for the Kubeconfig file by default.

microk8s config | tee ~/.kube/config

Now let’s enable a few addons for using features like role based access control, usage of local volume for storage, and load balancing.

# Enable rbac for role based access control
sudo microk8s enable rbac

# Enable storage and hostpath-storage
sudo microk8s enable storage hostpath-storage

# Enable metallb for load balancing
sudo apt install -y jq
IPADDR=$(ip -4 -j route get 2.2.2.2 | jq -r '.[] | .prefsrc')
sudo microk8s enable metallb:$IPADDR-$IPADDR

Once done, the list of enabled addons can be seen via microk8s status --wait-ready command. The output of the command should look similar to the following:

microk8s is running
...
addons:
  enabled:
    dns                  # (core) CoreDNS
    ha-cluster           # (core) Configure high availability on the current node
    helm                 # (core) Helm - the package manager for Kubernetes
    helm3                # (core) Helm 3 - the package manager for Kubernetes
    hostpath-storage     # (core) Storage class; allocates storage from host directory
    metallb              # (core) Loadbalancer for your Kubernetes cluster
    storage              # (core) Alias to hostpath-storage add-on, deprecated
...

Setup MinIO

Apache Spark can be configured to use S3 for object storage. However for simplicity, instead of using AWS S3, we’re going to use an S3-compliant object storage library minio, an add-on for which is shipped by default in microk8s installation. Using MinIO, we can have an S3 compliant bucket created locally which is more convenient than AWS S3 for experimentation purposes.

Let’s enable the minio addon for MicroK8s.

sudo microk8s enable minio

Authentication with MinIO is managed with an access key and a secret key. These credentials are generated and stored as Kubernetes secret when the minio add-on is enabled.

Let’s fetch these credentials and export them as environment variables in order to use them later.

export ACCESS_KEY=$(kubectl get secret -n minio-operator microk8s-user-1 -o jsonpath='{.data.CONSOLE_ACCESS_KEY}' | base64 -d)
export SECRET_KEY=$(kubectl get secret -n minio-operator microk8s-user-1 -o jsonpath='{.data.CONSOLE_SECRET_KEY}' | base64 -d)
export S3_ENDPOINT=$(kubectl get service minio -n minio-operator -o jsonpath='{.spec.clusterIP}')
export S3_BUCKET="spark-tutorial"

Later during the tutorial, we will need to create an S3 bucket and upload some sample files into this bucket. The MinIO add-on offers access to a built-in Web UI which can be used to interact with the local S3 object storage. Alternatively, we can also use AWS CLI if we prefer to use CLI commands over a graphical user interface.

To set up the AWS CLI, let’s run the following commands:

sudo snap install aws-cli --classic

aws configure set aws_access_key_id $ACCESS_KEY 
aws configure set aws_secret_access_key $SECRET_KEY 
aws configure set region "us-west-2" 
aws configure set endpoint_url "http://$S3_ENDPOINT"

For us to be able to open MinIO web UI in the browser, we will need the IP address and port at which the MinIO Web UI is exposed.

Let’s fetch the MinIO web interface URL as follows:

MINIO_UI_IP=$(kubectl get service microk8s-console -n minio-operator -o jsonpath='{.spec.clusterIP}')
MINIO_UI_PORT=$(kubectl get service microk8s-console -n minio-operator -o jsonpath='{.spec.ports[0].port}')
export MINIO_UI_URL=$MINIO_UI_IP:$MINIO_UI_PORT

The MinIO web UI URL is a combination of an IP address and port. Print it by running:

echo $MINIO_UI_URL

Let’s open this URL in a web browser. In the login page, the username is the access key and the password is the secret key we fetched earlier.

These credentials can now be viewed simply by echoing the variables ACCESS_KEY and SECRET_KEY:

echo $ACCESS_KEY
echo $SECRET_KEY

Once you’re logged in, you’ll see the MinIO console as shown below.

The list of the buckets currently in our S3 storage is empty. That’s because we have not created any buckets yet! Let’s proceed to create a new bucket now.

Click the “Create Bucket +” button on the top right. On the next screen, let’s choose “spark-tutorial” for the name of the bucket and click “Create Bucket”.

Alternatively, if you prefer to use AWS CLI, the same task of creating the bucket can be done with the following command:

aws s3 mb s3://spark-tutorial

That’s it. We now have a S3 bucket available locally on our system! This can be verified by listing the S3 buckets using the following command:

aws s3 ls
# 
# 2024-02-07 07:47:05 spark-tutorial

With the access key, secret key and the endpoint properly configured, you should see spark-tutorial bucket listed in the output.

Set up Juju

Juju is an Operator Lifecycle Manager (OLM) for clouds, bare metal, LXD or Kubernetes. We’ll use juju to deploy and manage the Spark History Server and a number of other applications later to be integrated with Apache Spark. Let’s therefore let’s install and configure a juju client using a snap.

sudo snap install juju --channel 3.1/stable

mkdir -p ~/.local/share

Juju can automatically detect all available clouds on our local machine without the need of additional setup or configuration. You can verify this by running juju clouds command. You should see an output similar to the following:

Only clouds with registered credentials are shown.
There are more clouds, use --all to see them.
You can bootstrap a new controller using one of these clouds...

Clouds available on the client:
Cloud      Regions  Default    Type  Credentials  Source    Description
localhost  1        localhost  lxd   0            built-in  LXD Container Hypervisor
microk8s   1        localhost  k8s   0            built-in  A Kubernetes Cluster

As you can see, Juju has detected LXD as well as K8s installation in the system. For us to be able to deploy Kubernetes charms, let’s bootstrap a Juju controller in the microk8s cloud:

juju bootstrap microk8s spark-tutorial

The creation of the new controller can be verified with juju controllers command. The output of the command should be similar to:

Use --refresh option with this command to see the latest information.

Controller       Model  User   Access     Cloud/Region        Models  Nodes  HA  Version
spark-tutorial*  -      admin  superuser  microk8s/localhost       1      1   -  3.1.7

Set up spark-client snap and service accounts

When Spark jobs are run on top of Kubernetes, a set of resources like service account, associated roles, role bindings etc. need to be created and configured. To simplify this task, the Charmed Apache Spark solution offers the spark-client.

Let’s install the spark-client snap at first:

sudo snap install spark-client --channel 3.4/edge

Let’s create a Kubernetes namespace for us to use as a playground in this tutorial.

kubectl create namespace spark

We will now create a Kubernetes service account that will be used to run the Spark jobs. The creation of the service account can be done using the spark-client snap, which will create necessary roles, role bindings and other necessary configurations along with the creation of the service account.

spark-client.service-account-registry create \
  --username spark --namespace spark

This command does a number of things in the background. First, it creates a service account in the spark namespace with the name spark. Then it creates a role with name spark-role with all the required RBAC permissions and binds that role to the service account by creating a role binding.

These resources can be viewed with kubectl get commands as follows:

kubectl get serviceaccounts -n spark
# NAME      SECRETS   AGE
# default   0         50s
# spark     0         15s

kubectl get roles -n spark
# NAME         CREATED AT
# spark-role   2024-02-16T12:08:55Z

kubectl get rolebindings -n spark
# NAME                 ROLE              AGE
# spark-role-binding   Role/spark-role   69s

For Apache Spark to be able to access and use our local S3 bucket, we need to provide a few configurations including the bucket endpoint, access key and secret key. In Charmed Apache Spark solution, we bind these configurations to a Kubernetes service account such that when Spark jobs are executed with that service account, all the configurations bound to that service account are supplied to Apache Spark automatically.

The S3 configurations can be added to the spark service account we just created with the following command:

spark-client.service-account-registry add-config \
  --username spark --namespace spark \
  --conf spark.hadoop.fs.s3a.aws.credentials.provider=org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider \
  --conf spark.hadoop.fs.s3a.connection.ssl.enabled=false \
  --conf spark.hadoop.fs.s3a.path.style.access=true \
  --conf spark.hadoop.fs.s3a.access.key=$ACCESS_KEY \
  --conf spark.hadoop.fs.s3a.endpoint=$S3_ENDPOINT \
  --conf spark.hadoop.fs.s3a.secret.key=$SECRET_KEY

The list of configurations bound for the service account spark can be verified with the command:

spark-client.service-account-registry get-config \
  --username spark --namespace spark

You should see the following list of configurations in the output:

spark.hadoop.fs.s3a.access.key=<access_key> 
spark.hadoop.fs.s3a.aws.credentials.provider=org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider 
spark.hadoop.fs.s3a.connection.ssl.enabled=false 
spark.hadoop.fs.s3a.endpoint=<s3_endpoint>
spark.hadoop.fs.s3a.path.style.access=true 
spark.hadoop.fs.s3a.secret.key=<secret_key>
spark.kubernetes.authenticate.driver.serviceAccountName=spark
spark.kubernetes.namespace=spark

That’s it. We’re now ready to dive head-first into Apache Spark!

In the next section, we’ll start submitting commands to Apache Spark using the built-in interactive shell.