[SOLVED] Issues detecting CUDA-capable device on some nodes...

It appears that on some nodes, CUDA doesn’t see CUDA-capable device on Nvidia Jetson.

I first saw it when I tried to run this MPI/CUDA code:
https://www.pdc.kth.se/resources/software/installed-software/mpi-libraries/cuda-and-mpi

When I run it on node that doesn’t have issues, I get this:

mpinode@tegra102:~/HelloMPI$ mpirun -n 2 ./cuda-mpi
tegra102          0   1  0:GK20A
tegra102          1   1  0:GK20A

Running it on node with issues, I am getting this:

mpinode@tegra120:~/HelloMPI$ mpirun -n 2 cuda-mpi
tegra120          0  91  0:?S  1:?S  2:?S  3:?S  4:?S  5:?S  6:?S  7:?S  8:?S  9:?S  10:?S  11:?S  12:?S  13:?S  14:?S  15:?S  16:?S  17:?S  18:?S  19:?S  20:?S  21:?S  22:?S  23:?S  24:?S  25:?S  26:?S  27:?S  28:?S  29:?S  3tegra120          1  91  0:?S  1:?S  2:?S  3:?S  4:?S  5:?S  6:?S  7:?S  8:?S  9:?S  10:?S  11:?S  12:?S  13:?S  14:?S  15:?S  16:?S  17:?S  18:?S  19:?S  20:?S  21:?S  22:?S  23:?S  24:?S  25:?S  26:?S  27:?S  28:?S  29:?S  3
tegra120          1  91  0:?S  1:?S  2:?S  3:?S  4:?S  5:?S  6:?S  7:?S  8:?S  9:?S  10:?S  11:?S  12:?S  13:?S  14:?S  15:?S  16:?S  17:?S  18:?S  19:?S  20:?S  21:?S  22:?S  23:?S  24:?S  25:?S  26:?S  27:?S  28:?S  29:?S  3
[tegra120:07163] *** Process received signal ***
[tegra120:07164] *** Process received signal ***
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 7164 on node tegra120 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

Also, running CUDA example, message is returned about not detecting CUDA-capable device:

mpinode@tegra120:~/NVIDIA_CUDA-6.0_Samples/1_Utilities/deviceQuery$ ./deviceQuery 
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

cudaGetDeviceCount returned 38
-> no CUDA-capable device is detected
Result = FAIL
mpinode@tegra120:~/NVIDIA_CUDA-6.0_Samples/0_Simple/vectorAdd$ ./vectorAdd 
[Vector addition of 50000 elements]
Failed to allocate device vector A (error code no CUDA-capable device is detected)!

All nodes were configured the same way. For each node (which had pre-installed OS) I have install CUDA, MPI, and added right exports in .bashrc.

Does anyone know how can I fix this issue of failing to detect CUDA-capable device? I have about 7 devices out of 25 that behave like this.

Did you check that all the nodes had the same OS version installed? I recommend flashing the latest OS to all of them to be sure.

They do, mostly what came pre-installed.
Here is working node

Linux tegra104 3.10.24-g6a2d13a #1 SMP PREEMPT Fri Apr 18 15:56:45 PDT 2014 armv7l armv7l armv7l GNU/Linux

and node with issues.

Linux tegra101 3.10.24-g6a2d13a #1 SMP PREEMPT Fri Apr 18 15:56:45 PDT 2014 armv7l armv7l armv7l GNU/Linux

I also made sure that users on each of those nodes have video group. So this is not the case of the problem.

Yes, I guess if I will not see any way to solve this, I will have to flash them.

That is an R19.x kernel version, which implies actual CUDA must be version 6.0. If version 6.5 is mixed in you will get a failure.

I doubt this is the issue but probably warrants a comment:
There are cases where remote access via something like “ssh -Y” causes any OpenGL (including ES) display code to run on the display machine rather than on the remote machine…the idea being to run remote applications anywhere but to display it on your local machine. CUDA can become confused in this case and inadvertently attempt to offload CUDA code to the display machine under the false impression that all GPU work is display work. When this happens CUDA itself will actually offload to the display machine if the display machine has the right CUDA version on it…but will fail in the same manner as failed graphics offload if the display machine does not have what is needed for the work offload. So be careful in the case of remote graphics display to look closely at logs and see if it is trying to offload GPU work instead of OpenGL/ES.

All nodes have R19.x kernel and CUDA 6.0.

What I also noticed is that all those nodes with issue do not have file nv_tegra_release in /etc/ directory.

Should I try to reinstall CUDA? or CUDA would not have anything to do with this file?

I believe I Just solved my problem. I didn’t run NVIDIA-INSTALLER on those nodes.