AmgX Compiling Issue

Longyin-Cui · June 26, 2015, 5:47am

Hi… dear few people:

I encountered a compling problem during compling AmgX library on a DLX HPC server, which I really really wish to solve.

Warning message as below:
“ld: warning: libcuda.so.1, needed by …/lib/libamgxsh.so, not found.”

I then tried to install this library on a local GPU server, debian based, no problem reported.
I then asked the administrator of the HPC server to comfirm the symlink.
Result as below:

[root@gnode028 lib]# ls -ld cuda
lrwxrwxrwx 1 root root 12 Jun 13 20:02 libcuda.so → libcuda.so.1
lrwxrwxrwx 1 root root 17 Jun 13 20:02 libcuda.so.1 → libcuda.so.346.46
-rwxr-xr-x 1 root root 13121100 Jun 13 20:02 libcuda.so.346.46

Other information:
1, the server system: REDHAT 6.2
2, CUDA version 5.5, 6.5 and 7.0 (there are multiple versions because I am only the one of many users, and I pointed everything to 6.5)
3, AmgX package downloaded is REDHAT 6.2 CUDA 6.5 .
4, -L switch, envrioment variables and PATH are correctly set (checked a million times)
5, I even tyied ln -s /user/local/cuda-6.5/lib64/stubs/libcuda.so libcuda.so.1 (compiling then will not generate warning, but runs into driver problems, like MPI libs not found and insufficient CUDA driver version.)

Currently since I don’t have the root access, much inconvinience has been brought.

Any help would be so appriciated…

Thanks…much…

Robert_Crovella · June 26, 2015, 1:57pm

The driver (346.46) should have installed this on RHEL 6.2:

$ ls /usr/lib/libcuda*
/usr/lib/libcuda.so /usr/lib/libcuda.so.1 /usr/lib/libcuda.so.346.46
$ ls /usr/lib64/libcuda*
/usr/lib64/libcuda.so /usr/lib64/libcuda.so.1 /usr/lib64/libcuda.so.346.46
$

The libcuda.so and libcuda.so.1 above are actually symlinks.
Does your system look the same? (you appear to have looked in the lib directory, but the amgx package is 64-bit and needs the one in the lib64 directory)

If so, what is the result of

echo $LD_LIBRARY_PATH

on your system (gnode028)?

Longyin-Cui · June 26, 2015, 6:50pm

I think this is the problem:
ls: cannot access /usr/lib64/libcuda*: No such file or directory
ls: cannot access /usr/lib64/libcuda*: No such file or directory
Maybe they put them in another place where I don’t have access since it’s driver related? which doesn’t make any sense.

The result from echo $LD_LIBRARY_PATH
/share/cluster/RHEL6.2/x86_64/apps/openmpi/1.6.2/lib:/share/cluster/RHEL6.2/x86_64/apps/intel/ict/composer_xe_2013.0.079/compiler/lib/intel64:/share/cluster/RHEL6.2/x86_64/apps/intel/ict/composer_xe_2013.0.079/mkl/lib/intel64:/usr/local/cuda-6.5/lib64:/usr/local/cuda-6.5/lib64/stubs/

the one libcuda.so I could find is in /usr/local/cuda-6.5/lib64/stubs/libcuda.so
for libcuda.so.1 I cannot locate it.

I also searched using
find /share/cluster/RHEL6.2/x86_64/app -iname “libcuda.*”
lots of permission denied and no useful answer.

so…

Robert_Crovella · June 26, 2015, 7:10pm

Is your RHEL 6.2 OS a 32-bit OS or a 64-bit OS ?

If it is a 64-bit OS (seems to be), then I would agree that the inability to find /usr/lib64/libcuda.so is curious and I can’t explain it. what is the result of running

nvidia-smi

on that server?

Are you building this on a cluster login or build node that doesn’t have GPUs installed?

Longyin-Cui · June 26, 2015, 7:25pm

It is 64-bit.
-bash: nvidia-smi: command not found

I think…yes, I am building this on a login node that doesn’t have GPUs installed.
They use MOAB and SLURM.

I have a feeling that there is a piece of key knowledge I’m missing…do you mean this library can only be installed with root access on all the GPU nodes linked so I could use it.

Robert_Crovella · June 26, 2015, 8:07pm

The login node most likely doesn’t have a GPU installed.
Without a GPU installed, your sysadmins may feel there is no obvious reason to install the GPU driver (the CUDA toolkit, for CUDA runtime API applications, does not depend on any driver component.)
libcuda.so (not the stubs you have been playing with) installed in /usr/lib and /usr/lib64 is put there by the driver installer, not the CUDA toolkit installer.
The response to nvidia-smi suggests that the driver was not installed also.

Apparently libamgxsh.so depends on libcuda:

$ ldd libamgxsh.so
ldd: warning: you do not have execution permission for `./libamgxsh.so'
        linux-vdso.so.1 =>  (0x00007fff929fe000)
        libcudart.so.6.5 => /usr/local/cuda-6.5/lib64/libcudart.so.6.5 (0x00007f77bf828000)
        libcuda.so.1 => /lib64/libcuda.so.1 (0x00007f77be97a000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00007f77be776000)
        libcublas.so.6.5 => /usr/local/cuda-6.5/lib64/libcublas.so.6.5 (0x00007f77bccd4000)
        libcusparse.so.6.5 => /usr/local/cuda-6.5/lib64/libcusparse.so.6.5 (0x00007f77ba3c7000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f77ba1aa000)
        libstdc++.so.6 => /lib64/libstdc++.so.6 (0x00007f77b9ea2000)
        libm.so.6 => /lib64/libm.so.6 (0x00007f77b9b9a000)
        libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f77b9984000)
        libc.so.6 => /lib64/libc.so.6 (0x00007f77b95c5000)
        librt.so.1 => /lib64/librt.so.1 (0x00007f77b93bc000)
        /lib64/ld-linux-x86-64.so.2 (0x000000319aa00000)
$

Having said all this, it is perhaps not an issue, if your compute nodes have a proper GPU driver installed on them. Note that the initial inquiry was around a warning, not an error:

ld: warning: libcuda.so.1, needed by ../lib/libamgxsh.so, not found.

That does not necessarily indicate a problem. It means in the current environment, libcuda.so can’t be located. However if the environment you will run on is a compute node with a GPU and a GPU driver, then most likely the libcuda.so will be in the proper place, and will be properly found on that machine, at run-time.

In other words, see what happens in the actual run environment ( i.e. just ignore this warning in the build environment.)

Longyin-Cui · June 27, 2015, 2:35am

Thank you for the detailed reply txbob!

Longyin-Cui · June 27, 2015, 2:45am

[gnode025.local:80140] mca: base: component_find: unable to open /share/cluster/RHEL6.2/x86_64/apps/openmpi/1.6.2/lib/openmpi/mca_mtl_mxm: libmxm.so.0: cannot open shared object file: No such file or directory (ignored)
[gnode025.local:80141] mca: base: component_find: unable to open /share/cluster/RHEL6.2/x86_64/apps/openmpi/1.6.2/lib/openmpi/mca_mtl_mxm: libmxm.so.0: cannot open shared object file: No such file or directory (ignored)
Process 0 selecting device 0
Process 1 selecting device 1
License acquired, proceeding
AMGX version 1.2.0-build108
Built on Dec 22 2014, 10:33:38
Compiled with CUDA Runtime 6.5, using CUDA driver 7.0
License acquired, proceeding

Cannot read file as JSON object, trying as AMGX config
Converting config string to current config version
Parsing configuration string: exception_handling=1 ;

Reading matrix dimensions in file: ./matrix.mtx
Reading data…
RHS vector was not found. Using RHS b=[1,▒~@▒,1]^T
Solution vector was not found. Setting initial solution to x=[0,▒~@▒,0]^T
Finished reading
AMG Grid:
Number of Levels: 1
LVL ROWS NNZ SPRSTY Mem (GB)
--------------------------------------------------------------
0(D) 12 61 0.424 1.1e-06
--------------------------------------------------------------
Grid Complexity: 1
Operator Complexity: 1
Total Memory Usage: 1.09896e-06 GB
--------------------------------------------------------------
iter Mem Usage (GB) residual rate
--------------------------------------------------------------
Ini 0.346046 3.464102e+00
0 0.346046 3.166381e+00 0.9141
1 0.3460 3.046277e+00 0.9621
2 0.3460 2.804132e+00 0.9205
3 0.3460 2.596292e+00 0.9259
4 0.3460 2.593806e+00 0.9990
5 0.3460 3.124839e-01 0.1205
6 0.3460 5.373423e-02 0.1720
7 0.3460 9.795357e-04 0.0182
8 0.3460 3.679172e-13 0.0000
--------------------------------------------------------------
Total Iterations: 9
Avg Convergence Rate: 0.0362
Final Residual: 3.679172e-13
Total Reduction in Residual: 1.062085e-13
Maximum Memory Usage: 0.346 GB
--------------------------------------------------------------
Total Time: 0.0103681
setup: 0.00306624 s
solve: 0.00730186 s
solve(per iteration): 0.000811317 s

This is the result I got, I only worried about the first two lines, it’s not a problem, I hope? Everything is the same as the example shown in readme.txt except for the libmxm.so.0 error. I don’t know if this will cause more trouble in later tests, that’s why I try to eliminate all the protetial dangers

Robert_Crovella · June 27, 2015, 4:30am

I don’t see any CUDA or AMGX issues. The first two lines have to do with your open MPI installation, not dense matrix multiply. There may be something up with your open MPI installation, but I’m not really sure about that.

Longyin-Cui · July 1, 2015, 6:22am

Thank you agin, you are great!