TensorFlow on sbatch/srun is way slower than on only srun or sbatch

Hi, I’m using a setup like this:

#SBATCH -J 'testrun' --gres=gpu:1 --gpus-per-task=1 --ntasks=1 --time=08:00:00 --mem-per-cpu=10000
srun --ntasks=1 -J "shell" --mem-per-cpu='10000' --time="08:00:00" --gres=gpu:1 python3 train_gpu.py &

in a .sh file that I run with slurm on our HPC-cluster.
When directly running srun … python3 train_gpu.py, it works Ok. But when I do it in a sbatch-container (which I need to for several reasons, one is that I need many of them in parallel running at the same time and connecting to a database that is automatically started by the sbatch-script), I get far worse results. The GPU is barely used.

Here you can see the results inside an sbatch:

(Since I’m a new user here I cannot upload two images. But the results of running it directly via srun … without sbatch are way better and always above at least 20% gpu usage).

Why does this happen? And how could I debug this?

What I do now is to repeatedly run nvidia-smi (where the graph data comes from) and to check if a process has an active file handle to /dev/nvidia-… in /proc/$$/fd, but this is a very limited way of debugging this.

I have no root rights on the machines and I cannot install packages except in my home dir.

We use slurm 20.02.2 on linux kernel 3.10 (a derivative of red hat), and this appears to be happening on both x86_64, and powerpc64le. The runtime also takes a lot longer, from about 30 minutes to about 4 hours.

Is there anything I can do to find out why this happens?