Welcome to join another exciting community workshop with Juju and related projects!
This time @mmrezaie will take us through an interesting deployment of Slurm charms to build a HPC cluster with juju and lxd.
Meeting link: meet.google.com/ceh-zber-jnf
Date: Fri 2021-12-17 09:00 – 10:00 (UTC)
Abstract
In this workshop we will deploy the Slurm charms on a LXD cloud consisting of hardware mixed with CPU and GPU nodes. We will use different LXC profiles and collocate a few LXC containers per physical node to increase cluster’s hardware utilization. Then we launch CPU and GPU jobs using Slurm on those containers and see how it runs.
Slurm is used by more than 50% of all top500 compute clusters in the world and the de-facto workload scheduler for large compute clusters both in industry and academia.
The main challenges:
- Deploying Slurm Charms on an HPC Cluster with Heterogeneous Hardware
- Addressing Under-utilization of HPC cluster (especially with heterogeneous hardware, e.g., GPU and CPU mixed nodes)
- Different LXC profiles per one JuJu model
Discussion
We can discuss how we can further improve this solution and make it scalable, e.g., How can we mix different LXC profiles for one JuJu model? Can LXC profiles have conditions in them?
Help out
Please use your network and connections to invite people or post invitations in communities related to HPC, Slurm or LXD communities. Help us grow the juju community!
About @mmrezaie
@mmrezaie is a technical expert and PhD level researcher focusing on Diverse SLO and Heterogeneous Hardware on HPC clusters.