CUDA-aware fails

armendzh · November 11, 2014, 3:17pm

I have put together a cluster for running CUDA-aware MPI programs. After I install CUDA 6.0 toolkit, I have compiled and installed OpenMPI with support for CUDA-aware enabled.

However, when I try to run MPI matrix multiplication code, I am getting this error:

mpinode@tegra100:~/HelloMPI$ mpiexec -np 16 --hostfile myhostsmall --map-by core --mca mpi_cuda_support 1 mpi_mm 1024
--------------------------------------------------------------------------
The library attempted to open the following supporting CUDA libraries, 
but each of them failed.  CUDA-aware support is disabled.
libcuda.so.1: cannot open shared object file: No such file or directory
libcuda.so.1: cannot open shared object file: No such file or directory
If you are not interested in CUDA-aware support, then run with 
--mca mpi_cuda_support 0 to suppress this message.  If you are interested
in CUDA-aware support, then try setting LD_LIBRARY_PATH to the location
of libcuda.so.1 to get passed this issue.
--------------------------------------------------------------------------
[tegra100:07000] 3 more processes have sent help message help-mpi-common-cuda.txt / dlopen failed
[tegra100:07000] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
Time is 11.286517

I can’t seem to fix this issue. I added path to export in bashrc file where the libcuda.so.1 is located.

Here is my .bashrc

# TimeFormat Settings
TIMEFORMAT='%3R'

# CUDA 6.0
export PATH=/usr/local/cuda-6.0/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-6.0/lib:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/usr/lib/arm-linux-gnueabihf/tegra:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/usr/local/cuda-6.0/armv7-linux-gnueabihf/lib:$LD_LIBRARY_PATH
# CUPTI
export PATH=/usr/local/cuda-6.0/extras/CUPTI:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-6.0/extras/CUPTI/lib:$LD_LIBRARY_PATH

# OpenMPI
export LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/usr/local/lib/openmpi:$LD_LIBRARY_PATH

My LD_LIBRARY_PATH:

mpinode@tegra100:~/HelloMPI$ echo $LD_LIBRARY_PATH
/usr/local/lib/openmpi:/usr/local/lib:/usr/local/cuda-6.0/extras/CUPTI/lib:/usr/local/cuda-6.0/armv7-linux-gnueabihf/lib:/usr/lib/arm-linux-gnueabihf/tegra:/usr/local/cuda-6.0/lib:

linuxdev · November 11, 2014, 4:01pm

Where is your actual libcuda.so.1 file located?
Is this libcuda.so.1 and directory set with permissions that your executed file can access?
If you cd to the libcuda.so.1 directory, what is the output of "ldd libcuda.so.1" (similarly, run ldd on your executable and show what this says)?
Is this CUDA 6.0 running on R19.x version L4T, or is it the newer R21.1?
Was the application compiled natively on Jetson?
If you cd to /etc/ld.so.conf.d/ and cat the files (cat *.conf), is the libcuda.so.1 and (all ldd libcuda.so.1 shows) located in one of those directories?

armendzh · November 11, 2014, 5:02pm

Thank you for your reply.

Location of libcuda.so.1:

/usr/lib/arm-linux-gnueabihf/tegra/libcuda.so.1

Permissions

mpinode@tegra100:/usr/lib/arm-linux-gnueabihf$ ls -la tegra
drwxr-xr-x  2 root root     4096 Nov  1 02:44 .
drwxr-xr-x 96 root root    69632 Nov  1 04:23 ..
-rw-r--r--  1 root root       35 Nov  1 02:44 ld.so.conf
lrwxrwxrwx  1 root root       14 Aug 29 00:36 libcuda.so -> libcuda.so.1.1
lrwxrwxrwx  1 root root       14 Nov  1 02:44 libcuda.so.1 -> libcuda.so.1.1

mpinode@tegra100:/usr/lib/arm-linux-gnueabihf/tegra$ ldd libcuda.so.1
	libdl.so.2 => /lib/arm-linux-gnueabihf/libdl.so.2 (0xb5bdd000)
	libm.so.6 => /lib/arm-linux-gnueabihf/libm.so.6 (0xb5b71000)
	libpthread.so.0 => /lib/arm-linux-gnueabihf/libpthread.so.0 (0xb5b56000)
	librt.so.1 => /lib/arm-linux-gnueabihf/librt.so.1 (0xb5b48000)
	libc.so.6 => /lib/arm-linux-gnueabihf/libc.so.6 (0xb5a61000)
	/lib/ld-linux-armhf.so.3 (0xb6703000)

I am running CUDA 6.0 on L4T version R19.x. I haven’t tried R21.1 just yet
I compiled both OpenMPI and mpi matrix multiplication program on L4T version R19.x
I guess the path where libcuda.so.1 is located is already set in one of the ld config files. So probably those extra exports that I added to bashrc file didn’t really do anything useful.

mpinode@tegra100:/etc/ld.so.conf.d$ cat *.conf
# Multiarch support
/lib/arm-linux-gnueabihf
/usr/lib/arm-linux-gnueabihf
/usr/lib/arm-linux-gnueabihf/tegra-egl
/usr/lib/arm-linux-gnueabihf/tegra
/usr/lib/arm-linux-gnueabihf/libfakeroot
# libc default configuration
/usr/local/lib
/usr/lib/arm-linux-gnueabihf/tegra

linuxdev · November 11, 2014, 6:19pm

Exports of PATH would be for executables and not change library linking. ldd of the actual library appears to show libraries are found correctly. I don’t see ldd of your OpenMPI application though…which could be important.

For each exported LD_LIBRARY_PATH, I would check if it is already in /etc/ld.so.conf.d/* and remove the explicit export…it is possible order of linking might get in the way if there are other versions mixed in. So remove those extra paths and find out what ldd shows on the actual executable…see which libraries it is looking at…the libraries themselves look at other libraries which seems ok.

One other possible issue…are you doing remote display or is display native on Jetson? There is a bug related to remote display which might be a problem.