Description
I’m running the sampleMovieLensMPS tutorial perfectly on a node. See below results. I’m trying to get it to work with Slurm and MPS from the head node (which does not have a GPU).
[root@node001 bin]# ./sample_movielens_mps_debug -b 2 -p 2
&&&& RUNNING TensorRT.sample_movielens_mps # ./sample_movielens_mps_debug -b 2 -p 2
[03/14/2020-16:20:04] [I] ../../../data/movielens/movielens_ratings.txt
[03/14/2020-16:20:05] [I] Begin parsing model...
[03/14/2020-16:20:05] [I] End parsing model...
[03/14/2020-16:20:06] [I] [TRT] Detected 2 inputs and 3 output network tensors.
[03/14/2020-16:20:06] [I] End building engine...
[03/14/2020-16:20:08] [I] Done execution in process: 234912 . Duration : 363.04 microseconds.
[03/14/2020-16:20:08] [I] Num of users : 2
[03/14/2020-16:20:08] [I] Num of Movies : 100
[03/14/2020-16:20:08] [I] | PID : 234912 | User : 0 | Expected Item : 128 | Predicted Item : 128 |
[03/14/2020-16:20:08] [I] | PID : 234912 | User : 1 | Expected Item : 133 | Predicted Item : 133 |
[03/14/2020-16:20:08] [I] Done execution in process: 234913 . Duration : 285.216 microseconds.
[03/14/2020-16:20:08] [I] Num of users : 2
[03/14/2020-16:20:08] [I] Num of Movies : 100
[03/14/2020-16:20:08] [I] | PID : 234913 | User : 0 | Expected Item : 128 | Predicted Item : 128 |
[03/14/2020-16:20:08] [I] | PID : 234913 | User : 1 | Expected Item : 133 | Predicted Item : 133 |
[03/14/2020-16:20:08] [I] Number of processes executed : 2. Total MPS Run Duration : 2023.86 milliseconds.
&&&& PASSED TensorRT.sample_movielens_mps # ./sample_movielens_mps_debug -b 2 -p 2
Environment
TensorRT Version:
6.0.1.5
GPU Type:
Nvidia Tesla V100 32GB
Nvidia Driver Version
440.33.01
CUDA Version:
CUDA 10.2
CUDNN Version:
V10.1.243, CUDNN 7.5.1
Operating System + Version:
Centos 7.7 on Bright Cluster
Python Version (if applicable):
3.6
Relevant Files
Here are the contents of an sbatch file I have:
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --job-name=MPSMovieTest
#SBATCH --gres=gpu:1
#SBATCH --nodelist=node001
#SBATCH --output=mpstest.out
export CUDA_VISIBLE_DEVICES=0
nvidia-smi -i 0
export CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps
export CUDA_MPS_LOG_DIRECTORY=/tmp/nvidia-log
nvidia-cuda-mps-control -d
module load shared slurm openmpi/cuda/64 cm-ml-python3deps/3.2.3 cudnn/7.0 slurm cuda10.1/toolkit ml-pythondeps-py36-cuda10.1-gcc/3.2.3 tensorflow-py36-cuda10.1-gcc tensorrt-cuda10.1-gcc/6.0.1.5 gcc gdb keras-py36-cuda10.1-gcc nccl2-cuda10.1-gcc
/cm/shared/apps/tensorrt-cuda10.1-gcc/6.0.1.5/bin/sample_movielens_mps -b 2 -p 2
Steps To Reproduce
I’m trying to use srun
to test this but it always fails as it appears to be trying all nodes. We only have 3 compute nodes. As I’m writing this node002
and node003
are in use by other users so I just want to use node001
.
srun /home/mydir/mpsmovietest --gres=gpu:1 --job-name=MPSMovieTest --nodes=1 --nodelist=node001 -Z --output=mpstest.out
Tue Apr 14 16:45:10 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-PCIE... On | 00000000:3B:00.0 Off | 0 |
| N/A 67C P0 241W / 250W | 32167MiB / 32510MiB | 100% E. Process |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 428996 C python3.6 32151MiB |
+-----------------------------------------------------------------------------+
Loading openmpi/cuda/64/3.1.4
Loading requirement: hpcx/2.4.0 gcc5/5.5.0
Loading cm-ml-python3deps/3.2.3
Loading requirement: python36
Loading tensorflow-py36-cuda10.1-gcc/1.15.2
Loading requirement: openblas/dynamic/0.2.20 hdf5_18/1.8.20
keras-py36-cuda10.1-gcc/2.3.1 protobuf3-gcc/3.8.0 nccl2-cuda10.1-gcc/2.5.6
&&&& RUNNING TensorRT.sample_movielens_mps # /cm/shared/apps/tensorrt-cuda10.1-gcc/6.0.1.5/bin/sample_movielens_mps -b 2 -p 2
[03/14/2020-16:45:10] [I] ../../../data/movielens/movielens_ratings.txt
[E] [TRT] CUDA initialization failure with error 999. Please check your CUDA installation: http://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html
[E] Could not create builder.
[03/14/2020-16:45:10] [03/14/2020-16:45:10] &&&& FAILED TensorRT.sample_movielens_mps # /cm/shared/apps/tensorrt-cuda10.1-gcc/6.0.1.5/bin/sample_movielens_mps -b 2 -p 2
srun: error: node002: task 0: Exited with exit code 1
So is my syntax wrong with srun
? MPS is running:
$ ps -auwx|grep mps
root 108581 0.0 0.0 12780 812 ? Ssl Mar23 0:54 /cm/local/apps/cuda-driver/libs/440.33.01/bin/nvidia-cuda-mps-control -d