New BigData Initiative
We (@tyler and myself) have started to cut new versions of the ASF charms to facilitate our needs to run more recent releases of the software than can be found in the bigtop releases/upstream charms. We quickly set sight on a few additional set of goals in writing these charms in order to accommodate our environment’s needs for multi-homed networking and managed storage within some of the ASF components.
The primary goal of these charms is to let a user deploy spark workloads in a multi-homed network environment + leveraging juju storage to facilitate the software components individual storage needs AND the ability to be able to deploy whatever version of the upstream software the user please by supplying the tarball of the software as a resource.
Some key features:
- New big data development workflow component: Conda charm.
- Usability enhancement: Spark configuration for use with radosgw or aws s3 out of the box.
- Network space support - Zookeeper, Jupyter-Notebook, Spark.
- Juju storage support: Zookeeper
- S3 Support (rough): Jupyter-Notebook, Spark
Note: We had to cut these charms in a time box so they have a quite a bit of room for improvement across the board. I basically just wanted to get them out there and working and then start the process of iterating on them to be more generally useful. They have some rough edges no doubt that we intend to smooth out pretty quickly.
Below are some of the starts. More to come in the near future.
Zookeeper
Charmstore: https://jujucharms.com/u/omnivector/zookeeper/
Github:
* layer-zookeeper - https://github.com/omnivector-solutions/layer-zookeeper
* interface-zookeeper - https://github.com/omnivector-solutions/interface-zookeeper
Spark
Charmstore: https://jujucharms.com/u/omnivector/spark/
Github:
* layer-spark - https://github.com/omnivector-solutions/layer-spark
* layer-spark-base - https://github.com/omnivector-solutions/layer-spark-base
* layer-hadoop-base - https://github.com/omnivector-solutions/layer-hadoop-base
* interface-spark - https://github.com/omnivector-solutions/interface-spark
Jupyter-Notebook + Spark
Charmstore: https://jujucharms.com/u/omnivector/jupyter-notebook/
Github:
* layer-jupyter-notebook - https://github.com/omnivector-solutions/layer-jupyter-notebook
Conda
Charmstore: https://jujucharms.com/u/omnivector/conda/
Github:
* layer-conda - https://github.com/omnivector-solutions/layer-conda
* layer-conda-api - https://github.com/omnivector-solutions/layer-conda-api
Jupyter-Notebook + Conda + Spark
Just an example of how these come together. The object storage gateway could be an aws s3 endpoint or ceph object storage gateway endpoint. This stack is primarily used for a spark standalone cluster use case, but the jupyter notebook built with layer-spark make the jupyter-notebook charm alone a great way to interface to deploying spark 2.4.x workloads to k8s via jupyter notebook. Here is the bundle I’ve been beating on.
series: bionic
applications:
spark:
charm: cs:~omnivector/spark
constraints: "tags=bdx-test spaces=mgmt,access"
num_units: 3
options:
object-storage-gateway: "<object-storage-endpoint-url>"
aws-access-key: "<s3-access-key>"
aws-secret-key: "<s3-secret-key>"
bindings:
"": mgmt
spark: access
jupyter-notebook:
charm: cs:~omnivector/jupyter-notebook
constraints: "tags=bdx-test spaces=mgmt,access"
num_units: 1
options:
object-storage-gateway: "<object-storage-endpoint-url>"
aws-access-key: "<s3-access-key>"
aws-secret-key: "<s3-secret-key>"
bindings:
"": mgmt
http: access
conda:
charm: cs:~omnivector/conda
num_units: 0
options:
conda-extra-packages: "pyspark=2.4.0 numpy ipykernel pandas pip"
conda-extra-pip-packages: "psycopg2 Cython git+https://<oauthkey>:x-oauth-basic@github.com/<my-private-org>/<my-private-repo>@master"
relations:
- - spark:juju-info
- conda:juju-info
- - jupyter-notebook:juju-info
- conda:juju-info
Model Controller Cloud/Region Version SLA Timestamp
spark01 pdl-maas pdl-maas 2.5.4 unsupported 03:07:31Z
App Version Status Scale Charm Store Rev OS Notes
conda-pdlda active 6 conda jujucharms 13 ubuntu
jupyter-notebook active 1 jupyter-notebook jujucharms 19 ubuntu
pdl-bdx-conda00 active 6 conda jujucharms 13 ubuntu
spark 2.4.1 active 5 spark jujucharms 14 ubuntu
Unit Workload Agent Machine Public address Ports Message
jupyter-notebook/57* active idle 127 10.10.11.29 8888/tcp http://10.100.211.10:8888
conda-pdlda/11 active idle 10.10.11.29 Conda Env Installed: conda-pdlda
pdl-bdx-conda00/5 active idle 10.10.11.29 Conda Env Installed: pdl-bdx-conda00
spark/123 active idle 128 10.10.11.35 7078/tcp,8081/tcp Services: worker
conda-pdlda/6* active idle 10.10.11.35 Conda Env Installed: conda-pdlda
pdl-bdx-conda00/3* active idle 10.10.11.35 Conda Env Installed: pdl-bdx-conda00
spark/124* active idle 129 10.10.11.31 7077/tcp,7078/tcp,8080/tcp,8081/tcp,18080/tcp Running: master,worker,history
conda-pdlda/10 active idle 10.10.11.31 Conda Env Installed: conda-pdlda
pdl-bdx-conda00/2 active idle 10.10.11.31 Conda Env Installed: pdl-bdx-conda00
spark/125 active idle 130 10.10.11.37 7078/tcp,8081/tcp Services: worker
conda-pdlda/9 active idle 10.10.11.37 Conda Env Installed: conda-pdlda
pdl-bdx-conda00/1 active idle 10.10.11.37 Conda Env Installed: pdl-bdx-conda00
spark/126 active idle 131 10.10.11.17 7078/tcp,8081/tcp Services: worker
conda-pdlda/7 active idle 10.10.11.17 Conda Env Installed: conda-pdlda
pdl-bdx-conda00/4 active idle 10.10.11.17 Conda Env Installed: pdl-bdx-conda00
spark/127 active idle 132 10.10.11.40 7078/tcp,8081/tcp Services: worker
conda-pdlda/8 active idle 10.10.11.40 Conda Env Installed: conda-pdlda
pdl-bdx-conda00/0 active idle 10.10.11.40 Conda Env Installed: pdl-bdx-conda00
Machine State DNS Inst id Series AZ Message
127 started 10.10.11.29 d3-util-03 bionic d3 Deployed
128 started 10.10.11.35 d3-util-04 bionic d3 Deployed
129 started 10.10.11.31 d4-util-05 bionic d4 Deployed
130 started 10.10.11.37 d3-util-01 bionic d3 Deployed
131 started 10.10.11.17 d4-util-06 bionic d4 Deployed
132 started 10.10.11.40 d4-util-03 bionic d4 Deployed
Following deployment of ^ bundle. You should be able to login to the jupyter notebook and start running jobs that have access to your object storage via s3a. In this way you can run distributed spark/pyspark workloads in spark standalone mode using ceph object storage as a backend, eliminating the need for yarn, hadoop, and/or hdfs.
A simple example.
import os
os.environ['PYSPARK_PYTHON'] = '/opt/anaconda/envs/conda/bin/python'
from pyspark.sql import SparkSession
from pyspark import SparkConf
conf = SparkConf()\
.setAppName('spark_playground')\
.setMaster('spark://<master-ip-address>:7077')
spark = SparkSession.builder.config(conf=conf).getOrCreate()
sc = spark.sparkContext
sc.textFile("s3a://path/to/your/datafile.txt").take(1)
Now that we have a working zookeeper charm our next step is to circle back around and put more cycles into the spark charm to decouple the node types and make a relation to zookeeper to get spark HA master functionality and shuffle service + shuffle storage working.
Insights, comments, pull requests welcome!