It appears that on some nodes, CUDA doesn’t see CUDA-capable device on Nvidia Jetson.
I first saw it when I tried to run this MPI/CUDA code:
https://www.pdc.kth.se/resources/software/installed-software/mpi-libraries/cuda-and-mpi
When I run it on node that doesn’t have issues, I get this:
mpinode@tegra102:~/HelloMPI$ mpirun -n 2 ./cuda-mpi
tegra102 0 1 0:GK20A
tegra102 1 1 0:GK20A
Running it on node with issues, I am getting this:
mpinode@tegra120:~/HelloMPI$ mpirun -n 2 cuda-mpi
tegra120 0 91 0:?S 1:?S 2:?S 3:?S 4:?S 5:?S 6:?S 7:?S 8:?S 9:?S 10:?S 11:?S 12:?S 13:?S 14:?S 15:?S 16:?S 17:?S 18:?S 19:?S 20:?S 21:?S 22:?S 23:?S 24:?S 25:?S 26:?S 27:?S 28:?S 29:?S 3tegra120 1 91 0:?S 1:?S 2:?S 3:?S 4:?S 5:?S 6:?S 7:?S 8:?S 9:?S 10:?S 11:?S 12:?S 13:?S 14:?S 15:?S 16:?S 17:?S 18:?S 19:?S 20:?S 21:?S 22:?S 23:?S 24:?S 25:?S 26:?S 27:?S 28:?S 29:?S 3
tegra120 1 91 0:?S 1:?S 2:?S 3:?S 4:?S 5:?S 6:?S 7:?S 8:?S 9:?S 10:?S 11:?S 12:?S 13:?S 14:?S 15:?S 16:?S 17:?S 18:?S 19:?S 20:?S 21:?S 22:?S 23:?S 24:?S 25:?S 26:?S 27:?S 28:?S 29:?S 3
[tegra120:07163] *** Process received signal ***
[tegra120:07164] *** Process received signal ***
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 7164 on node tegra120 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
Also, running CUDA example, message is returned about not detecting CUDA-capable device:
mpinode@tegra120:~/NVIDIA_CUDA-6.0_Samples/1_Utilities/deviceQuery$ ./deviceQuery
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
cudaGetDeviceCount returned 38
-> no CUDA-capable device is detected
Result = FAIL
mpinode@tegra120:~/NVIDIA_CUDA-6.0_Samples/0_Simple/vectorAdd$ ./vectorAdd
[Vector addition of 50000 elements]
Failed to allocate device vector A (error code no CUDA-capable device is detected)!
All nodes were configured the same way. For each node (which had pre-installed OS) I have install CUDA, MPI, and added right exports in .bashrc.
Does anyone know how can I fix this issue of failing to detect CUDA-capable device? I have about 7 devices out of 25 that behave like this.