Making Data Fabric compete in the market

magicaltrout · 2 January 2024 16:22

Happy 2024 folks, hopefully you’ve all made it through the new year.

As some of the Data Fabric folks know I’ve been playing around with it since it was relaunched a little while ago.

I like the Spark implementation, as I said on my little vlog thing the other week, it feels like there should be an easy way for developers to hook into it within the Kubernetes cluster(I know there is some docs on this). Of course those folks who might be evaluating this as a more open version of Databricks or whatever would want a notebook type environment. I looked at the Jupyter charms, from the AI/ML side of the house, the bundle at least hasn’t been updated in a while. I assume they’re still supported?

The other things I’d be interested in is plugging Minio and Nessie into the backend of the Spark stack for storage management along with the Iceberg jars. I’m happy to work on this, I was just wanting to double check that I wouldn’t be redoing work thats already going on somewhere or roadmapped etc?

Also, curious on the likelyhood of a Flink bundle for Data Fabric? Feels like we need to try and implement some facets of the Data Lakehouse type of setup and streaming data platforms. Obviously there’s already Kafka, but being able to stream that into a Spark Lakehouse using spark streaming and iceberg (or hudi etc) for storage will be a requirement for businesses.

LMK!

magicaltrout · 4 January 2024 00:19

I have another query about how all this is supposed to hang together just from an initial testing standpoint.

So, I have a MicroK8S cluster, in there I’ve deployed Minio, I’d quite like to store my data in there from Spark. Clearly, when launching my Spark session I could dig up these details and plug them in, write to minio and everything will work. But we’re in the world of automation. We seem to be in a strange paradox where we install Spark stuff as a snap with Kubernetes kicking around behind it, but then to deploy Minio or whatever on the backend we’ve then got Juju, and they don’t talk to each other.

Hypothetically I want to be able to connect Minio to Spark and have it just know it exists so that I can write to it, otherwise I’m off to use Databricks where I can chuck it in a cloud and hook it up to my buckets. How do we make this easier?

deusebio · 4 January 2024 15:34

Hi Tom!

First of all, happy 2024 to you as well ! And many thanks for your feedback and comment!

Indeed, many of the things you mention are extremely spot on and things we are working on or planning to. Let me address them below:

Jupyter charm. Right now we have a first Jupyter integration with Charmed Spark, provided by a spark image that can be run locally. We are currently in the process of separating this feature into a dedicated image (see PR here) in order to have a bit more minimal and segregated images. This is propedeutic to also charm this, which is something that will be addressed in the mid-term (within the next months). However, already with the image, one can spin up a Jupyter notebook powered by Spark locally. See here for an example on how to set this up easily if you have a MicroK8s cluster running locally. The UX may slightly change (e.g. the image to be used) once the PR will land, but hopefully you can the gist.
Integration with Iceberg jars. We are about to integrate the Iceberg jars in the Charmed Spark image actually right now. We have a task in the current sprint backlog, you see the Jira epic here.
Object storage integration (and integrations in general). We are currently designing a charm that should land in the next months to centralize integrations. The idea is indeed (as you also suggest) to NOT have the developer to set up the S3-compatible backend bindings manually in the configuration (e.g. as suggested here), but rather encode these via juju relations. We envision a “Configuration Hub Charm”, that relates to other charms (e.g. s3-integrator) and that charm builds the low-level configuration for the user. The spark-client snap then uses these configurations when running a Spark job. Note that this can also work for other set of configurations, e.g. monitoring, and in general for integrations.
Flink Unfortunately, Flink is not yet in our roadmap in the short term, say until April-May this year, but it is certainly on our radar, and I have been discussing this with @robgibbon quite few times.

Once again, many thanks for your feedback. It is really very much appreciated, and it is indeed nice to see that the need you are outlining here is aligned (mostly ) with the roadmap that we have envisioned.

Best, Enrico

magicaltrout · 4 January 2024 16:26

Thanks for the update @deusebio !

Glad to see we’re all on the rightish page. I started on a Nessie charm the other day that does the basics and works alongside the hand cranked stuff currently, so I’ll run with that for now.

I’ll probably take a look at charming flink in the not too distant future, hopefully we can align somewhere down the road.

If you need folks to test, or help out with chunks of the Spark integration stuff, feel free to reach out, I’ve got a few internal use cases for all of this that hopefully we can … spark(sorry)… some interest elsewhere with.