Mesos or Slurm or.. for job scheduling

Beco · January 12, 2016, 12:41pm

At my work place we have just built a DevBox with 4 Titan X gpus. We are several people who will be using this machine and wonder about what the best way to share access to the gpus and schedule jobs would be.

Since we are running Mesos+Marathon on the cluster where we will deploy the machine, I guess we could also share access to the DevBox’s GPUs via Mesos. From [1] and [2] I understand that the newest version of Mesos has now in-built support for Nvidia gpus. However, I cannot find anything concrete on the Mesos documentation about this and on [2] it seems this is only for Tesla GPUs. So… has anyone successfully used Mesos as a job scheduler with a DevBox? Would you recommended it?

We were also thinking on running our jobs with help of Docker containers. I have seen that NVIDIA provides utilities to build and run NVIDIA Docker images in [3]. Would I need to change anything in Marathon to launch these docker images?

I have also seen that Nvidia recommends several cluster management tools. Out of them, Slurm looks quite good and it is open-source, so I wonder if it would be easire and /or better to use Slurm instead of Mesos+Marathon (e.g., in terms of scheduling options). Any experience?

Any other options that would enable a small team to share the DevBox’s GPUs in an effective and painless way?

Cheers,
Humberto

[1] We're working with NVIDIA to bring GPUs and deep learning to the DCOS | D2iQ
[2] http://www.nvidia.com/object/apache-mesos
[3] GitHub - NVIDIA/nvidia-docker: Build and run Docker containers leveraging NVIDIA GPUs

Kerzmann · January 14, 2016, 12:23pm

We have pretty much the same setup featuring two DevBoxes and have literally the exact same questions and problems. We’ve also been trying to decide betweeen slurm and Mesos, but slurm has no Docker integration (for now), while Mesos’ documentation of GPU support is nowhere to be found.

Really looking forward to some insightful replies!

Best,
Robert

Amit_Kumar1 · February 3, 2016, 7:01pm

Hi Folks, I am keen in getting an answer to this question as well.

Looking forward to expert advise…

Regards,
Amit

stream_Y · June 13, 2016, 9:48am

Hi Folks,

I’m also searching for such a solution about scheduler + container.
Found that Docker has not integrated slurm officially yet.(If there is one, please ignore this.)
Is there any third-party implementation to combine slurm and docker right now?
Or does Marathon + Mesos support MPI?

Looking forward any advise!
Thx!

Regards,
Chace

Topic		Replies	Views
What software to use for our new single NVIDIA T4 Tesla card on VMware 6.7 ESXi Host General Discussion	14	10036	August 17, 2020
manage jobs in multi-gpu system with compute exclusive mode or not CUDA Programming and Performance	14	4062	September 3, 2010
NVIDIA Docker: GPU Server Application Deployment Made Easy Technical Blog	28	1566	December 22, 2019
New System Question CUDA Programming and Performance	6	6136	December 4, 2007
Need some assistance in working with NeMo Deep Learning (Training & Inference) docker , nemo	1	51	February 3, 2025
Slurm not working for MPS and TensorRT Movie Lens tutorial Container: HPC tensorrt , cuda , hpc	4	1913	October 12, 2021
cuBLAS kernels always run serially despite streams and AsyncMemCpy?!? CUDA Programming and Performance	17	5807	September 30, 2015
Guide to run CUDA + WSL + Docker with latest versions (21382 Windows build + 470.14 Nvidia) CUDA on Windows Subsystem for Linux cuda , wsl	22	34059	December 9, 2023
Enabling GPUs in the Container Runtime Ecosystem Technical Blog	12	694	February 23, 2022
[Multiple GPUs / Processes] CUDA Memory De/Allocation Slow CUDA Programming and Performance	25	9586	December 4, 2017

Mesos or Slurm or.. for job scheduling

Related topics