TensorFlow on sbatch/srun is way slower than on only srun or sbatch

nvidia608 · February 9, 2021, 9:33am

Hi, I’m using a setup like this:

#SBATCH -J 'testrun' --gres=gpu:1 --gpus-per-task=1 --ntasks=1 --time=08:00:00 --mem-per-cpu=10000
srun --ntasks=1 -J "shell" --mem-per-cpu='10000' --time="08:00:00" --gres=gpu:1 python3 train_gpu.py &

in a .sh file that I run with slurm on our HPC-cluster.
When directly running srun … python3 train_gpu.py, it works Ok. But when I do it in a sbatch-container (which I need to for several reasons, one is that I need many of them in parallel running at the same time and connecting to a database that is automatically started by the sbatch-script), I get far worse results. The GPU is barely used.

Here you can see the results inside an sbatch:

(Since I’m a new user here I cannot upload two images. But the results of running it directly via srun … without sbatch are way better and always above at least 20% gpu usage).

Why does this happen? And how could I debug this?

What I do now is to repeatedly run nvidia-smi (where the graph data comes from) and to check if a process has an active file handle to /dev/nvidia-… in /proc/$$/fd, but this is a very limited way of debugging this.

I have no root rights on the machines and I cannot install packages except in my home dir.

We use slurm 20.02.2 on linux kernel 3.10 (a derivative of red hat), and this appears to be happening on both x86_64, and powerpc64le. The runtime also takes a lot longer, from about 30 minutes to about 4 hours.

Is there anything I can do to find out why this happens?

pomazannn · October 18, 2022, 5:53am

you need to use --exclusive for sbatch.

Topic		Replies	Views
PyTorch with Slurm and MPS work-around --gres=gpu:1 Deep Learning (Training & Inference)	1	1523	September 11, 2020
Slurm not working for MPS and TensorRT Movie Lens tutorial Container: HPC tensorrt , cuda , hpc	4	1966	October 12, 2021
manage jobs in multi-gpu system with compute exclusive mode or not CUDA Programming and Performance	14	4180	September 3, 2010
pthread+CUDA for MultiGPUs CUDA Programming and Performance	4	14758	March 8, 2011
Why Multi-GPU slower than single GPUï¼Ÿ CUDA Programming and Performance	2	7627	September 14, 2011
multi gpu + exclusive mode + matlab, can't run two processes - kernel crashes CUDA Programming and Performance	39	9306	July 1, 2010
multiGPU poor performance up to 10x lowest performance in multiGPU CUDA Programming and Performance	14	10825	January 18, 2008
configuring workload manager on cluster with Nvidia Tesla s1070 CUDA Programming and Performance	4	3156	July 26, 2009
Speed problem on 295 gtx cards CUDA Programming and Performance	19	10573	January 8, 2010
Weird multiGPU performance About 10 times slower than single GPU CUDA Programming and Performance	10	3971	November 25, 2009

TensorFlow on sbatch/srun is way slower than on only srun or sbatch

Related topics