Slurm not working for MPS and TensorRT Movie Lens tutorial

Description

I’m running the sampleMovieLensMPS tutorial perfectly on a node. See below results. I’m trying to get it to work with Slurm and MPS from the head node (which does not have a GPU).

[root@node001 bin]# ./sample_movielens_mps_debug -b 2 -p 2
&&&& RUNNING TensorRT.sample_movielens_mps # ./sample_movielens_mps_debug -b 2 -p 2
[03/14/2020-16:20:04] [I] ../../../data/movielens/movielens_ratings.txt

[03/14/2020-16:20:05] [I] Begin parsing model...
[03/14/2020-16:20:05] [I] End parsing model...

[03/14/2020-16:20:06] [I] [TRT] Detected 2 inputs and 3 output network tensors.
[03/14/2020-16:20:06] [I] End building engine...
[03/14/2020-16:20:08] [I] Done execution in process: 234912 . Duration : 363.04 microseconds.
[03/14/2020-16:20:08] [I] Num of users : 2
[03/14/2020-16:20:08] [I] Num of Movies : 100
[03/14/2020-16:20:08] [I] | PID : 234912 | User :   0  |  Expected Item :  128  |  Predicted Item :  128 |
[03/14/2020-16:20:08] [I] | PID : 234912 | User :   1  |  Expected Item :  133  |  Predicted Item :  133 |
[03/14/2020-16:20:08] [I] Done execution in process: 234913 . Duration : 285.216 microseconds.
[03/14/2020-16:20:08] [I] Num of users : 2
[03/14/2020-16:20:08] [I] Num of Movies : 100
[03/14/2020-16:20:08] [I] | PID : 234913 | User :   0  |  Expected Item :  128  |  Predicted Item :  128 |
[03/14/2020-16:20:08] [I] | PID : 234913 | User :   1  |  Expected Item :  133  |  Predicted Item :  133 |
[03/14/2020-16:20:08] [I] Number of processes executed : 2. Total MPS Run Duration : 2023.86 milliseconds.
&&&& PASSED TensorRT.sample_movielens_mps # ./sample_movielens_mps_debug -b 2 -p 2

Environment

TensorRT Version:
6.0.1.5
GPU Type:
Nvidia Tesla V100 32GB
Nvidia Driver Version
440.33.01
CUDA Version:
CUDA 10.2
CUDNN Version:
V10.1.243, CUDNN 7.5.1
Operating System + Version:
Centos 7.7 on Bright Cluster
Python Version (if applicable):
3.6

Relevant Files

Here are the contents of an sbatch file I have:

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --job-name=MPSMovieTest
#SBATCH --gres=gpu:1
#SBATCH --nodelist=node001
#SBATCH --output=mpstest.out
export CUDA_VISIBLE_DEVICES=0
nvidia-smi -i 0
export CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps
export CUDA_MPS_LOG_DIRECTORY=/tmp/nvidia-log
nvidia-cuda-mps-control -d
module load shared slurm  openmpi/cuda/64 cm-ml-python3deps/3.2.3  cudnn/7.0 slurm cuda10.1/toolkit ml-pythondeps-py36-cuda10.1-gcc/3.2.3 tensorflow-py36-cuda10.1-gcc tensorrt-cuda10.1-gcc/6.0.1.5 gcc gdb keras-py36-cuda10.1-gcc nccl2-cuda10.1-gcc

/cm/shared/apps/tensorrt-cuda10.1-gcc/6.0.1.5/bin/sample_movielens_mps -b 2 -p 2

Steps To Reproduce

I’m trying to use srun to test this but it always fails as it appears to be trying all nodes. We only have 3 compute nodes. As I’m writing this node002 and node003 are in use by other users so I just want to use node001.

srun /home/mydir/mpsmovietest  --gres=gpu:1 --job-name=MPSMovieTest  --nodes=1 --nodelist=node001 -Z --output=mpstest.out
Tue Apr 14 16:45:10 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  On   | 00000000:3B:00.0 Off |                    0 |
| N/A   67C    P0   241W / 250W |  32167MiB / 32510MiB |    100%   E. Process |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0    428996      C   python3.6                                  32151MiB |
+-----------------------------------------------------------------------------+
Loading openmpi/cuda/64/3.1.4
  Loading requirement: hpcx/2.4.0 gcc5/5.5.0

Loading cm-ml-python3deps/3.2.3
  Loading requirement: python36

Loading tensorflow-py36-cuda10.1-gcc/1.15.2
  Loading requirement: openblas/dynamic/0.2.20 hdf5_18/1.8.20
    keras-py36-cuda10.1-gcc/2.3.1 protobuf3-gcc/3.8.0 nccl2-cuda10.1-gcc/2.5.6
&&&& RUNNING TensorRT.sample_movielens_mps # /cm/shared/apps/tensorrt-cuda10.1-gcc/6.0.1.5/bin/sample_movielens_mps -b 2 -p 2
[03/14/2020-16:45:10] [I] ../../../data/movielens/movielens_ratings.txt
[E] [TRT] CUDA initialization failure with error 999. Please check your CUDA installation:  http://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html
[E] Could not create builder.
[03/14/2020-16:45:10] [03/14/2020-16:45:10] &&&& FAILED TensorRT.sample_movielens_mps # /cm/shared/apps/tensorrt-cuda10.1-gcc/6.0.1.5/bin/sample_movielens_mps -b 2 -p 2
srun: error: node002: task 0: Exited with exit code 1

So is my syntax wrong with srun? MPS is running:

$ ps -auwx|grep mps
root     108581  0.0  0.0  12780   812 ?        Ssl  Mar23   0:54 /cm/local/apps/cuda-driver/libs/440.33.01/bin/nvidia-cuda-mps-control -d

It’s not really a TRT issue.
Moving to HPC forum so that HPC team can take a look.

I see, yes you are correct. When I run srun and no GPU processes are running on node002 the script works: Just an error about the logs, so this is clearly Slurm related:

srun /home/mydir/mpsmovietest  --gres=gpu:1 --job-name=MPSMovieTest  --nodes=1 --nodelist=node001 -Z --output=mpstest.out
Thu Apr 16 10:08:52 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  On   | 00000000:3B:00.0 Off |                    0 |
| N/A   28C    P0    25W / 250W |     41MiB / 32510MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0    420596      C   nvidia-cuda-mps-server                        29MiB |
+-----------------------------------------------------------------------------+
Warning: Failed writing log files to directory [/tmp/nvidia-log]. No logs will be available.
An instance of this daemon is already running
Warning: Failed writing log files to directory [/tmp/nvidia-log]. No logs will be available.
Loading openmpi/cuda/64/3.1.4
  Loading requirement: hpcx/2.4.0 gcc5/5.5.0

Loading cm-ml-python3deps/3.2.3
  Loading requirement: python36

Loading tensorflow-py36-cuda10.1-gcc/1.15.2
  Loading requirement: openblas/dynamic/0.2.20 hdf5_18/1.8.20
    keras-py36-cuda10.1-gcc/2.3.1 protobuf3-gcc/3.8.0 nccl2-cuda10.1-gcc/2.5.6
&&&& RUNNING TensorRT.sample_movielens_mps # /cm/shared/apps/tensorrt-cuda10.1-gcc/6.0.1.5/bin/sample_movielens_mps -b 2 -p 2
[03/16/2020-10:08:52] [I] ../../../data/movielens/movielens_ratings.txt
[03/16/2020-10:08:53] [I] Begin parsing model...
[03/16/2020-10:08:53] [I] End parsing model...
[W] [TRT] TensorRT was linked against cuDNN 7.6.3 but loaded cuDNN 7.0.5
[03/16/2020-10:08:57] [03/16/2020-10:08:57] [I] [TRT] Detected 2 inputs and 3 output network tensors.
[W] [TRT] TensorRT was linked against cuDNN 7.6.3 but loaded cuDNN 7.0.5
[03/16/2020-10:08:57] [03/16/2020-10:08:57] [I] End building engine...
[W] [TRT] TensorRT was linked against cuDNN 7.6.3 but loaded cuDNN 7.0.5
[W] [TRT] TensorRT was linked against cuDNN 7.6.3 but loaded cuDNN 7.0.5
[03/16/2020-10:09:01] [03/16/2020-10:09:01] [03/16/2020-10:09:01] [I] Done execution in process: 99395 . Duration : 315.744 microseconds.
[03/16/2020-10:09:01] [I] Num of users : 2
[03/16/2020-10:09:01] [I] Num of Movies : 100
[03/16/2020-10:09:01] [I] | PID : 99395 | User :   0  |  Expected Item :  128  |  Predicted Item :  128 |
[03/16/2020-10:09:01] [I] | PID : 99395 | User :   1  |  Expected Item :  133  |  Predicted Item :  133 |
[W] [TRT] TensorRT was linked against cuDNN 7.6.3 but loaded cuDNN 7.0.5
[W] [TRT] TensorRT was linked against cuDNN 7.6.3 but loaded cuDNN 7.0.5
[03/16/2020-10:09:01] [03/16/2020-10:09:01] [03/16/2020-10:09:01] [I] Done execution in process: 99396 . Duration : 306.944 microseconds.
[03/16/2020-10:09:01] [I] Num of users : 2
[03/16/2020-10:09:01] [I] Num of Movies : 100
[03/16/2020-10:09:01] [I] | PID : 99396 | User :   0  |  Expected Item :  128  |  Predicted Item :  128 |
[03/16/2020-10:09:01] [I] | PID : 99396 | User :   1  |  Expected Item :  133  |  Predicted Item :  133 |
[03/16/2020-10:09:02] [I] Number of processes executed : 2. Total MPS Run Duration : 4361.73 milliseconds.
&&&& PASSED TensorRT.sample_movielens_mps # /cm/shared/apps/tensorrt-cuda10.1-gcc/6.0.1.5/bin/sample_movielens_mps -b 2 -p 2

Bright Cluster’s version of Slurm does not include NVML, so Slurm needs to be compiled with NVML support for this to work.