[workshop] Building HPC clusters with LXD, Slurm & GPU:s

Welcome to join another exciting community workshop with Juju and related projects!

This time @mmrezaie will take us through an interesting deployment of Slurm charms to build a HPC cluster with juju and lxd.

Meeting link: meet.google.com/ceh-zber-jnf

Date: Fri 2021-12-17 09:00 – 10:00 (UTC)

Abstract

In this workshop we will deploy the Slurm charms on a LXD cloud consisting of hardware mixed with CPU and GPU nodes. We will use different LXC profiles and collocate a few LXC containers per physical node to increase cluster’s hardware utilization. Then we launch CPU and GPU jobs using Slurm on those containers and see how it runs.

Slurm is used by more than 50% of all top500 compute clusters in the world and the de-facto workload scheduler for large compute clusters both in industry and academia.

The main challenges:

  • Deploying Slurm Charms on an HPC Cluster with Heterogeneous Hardware
  • Addressing Under-utilization of HPC cluster (especially with heterogeneous hardware, e.g., GPU and CPU mixed nodes)
  • Different LXC profiles per one JuJu model

Discussion

We can discuss how we can further improve this solution and make it scalable, e.g., How can we mix different LXC profiles for one JuJu model? Can LXC profiles have conditions in them?

Help out

Please use your network and connections to invite people or post invitations in communities related to HPC, Slurm or LXD communities. Help us grow the juju community!

About @mmrezaie

@mmrezaie is a technical expert and PhD level researcher focusing on Diverse SLO and Heterogeneous Hardware on HPC clusters.

8 Likes

I’ll be there!

1 Like

So will I!

Very interesting and definitely a use case for us. Count me in!

2 Likes

Is there a recording of this workshop or any DIY material from it?

Hi @lhcwur, I am not aware of any, but it’s worth double-checking: @mmrezaie?

I would speak to @jamesbeedy @Heitor @mmrezaie about this as they have tons of material for you.

You can build your own Slurm cluster using Juju. We have an online documentation on how to accomplish that: https://omnivector-solutions.github.io/osd-documentation/master/

2 Likes

Thanks, that is very helpful.

Cheers

1 Like