GTC 2020: Multi-Node GPU Workloads with Unprivileged Containers on Slurm

GTC 2020 S21392
Presenters: Felix Abecassis,NVIDIA; Jonathan Calmels,NVIDIA
We’ll present the challenges in doing distributed deep-learning training at scale on shared heterogeneous infrastructure. At NVIDIA, we use containers extensively in our GPU clusters for both HPC and deep-learning applications. We love how containers simplify software packaging and enable reproducibility without sacrificing performance. While it’s possible to enable container workflows by granting users access to the docker daemon, the security impact needs to be carefully considered. Relying on docker for the container runtime also requires a large amount of complicated boilerplate code to start multi-node jobs that use message-passing interface for communication. We’ll introduce a new lightweight container runtime inspired from LXC (Linux Containers) and an associated Slurm plugin. Together, these two open-source projects enable a more secure architecture for our clusters, while also enabling a smoother user experience with containers on multi-node clusters.

