I wrote a program on Ubuntu 16.04, CUDA 9.1 and driver R390.30 with static likage (-Xcompiler -static). It runs correctly and provides the expected results.
Compiling on a MBP late 2012 (GT650M) with Xcode 9.4 and CUDA 10 also goes without any trouble and runs just fine.
However, I need to run this on a machine at work (where I have no admin access or the right to install anything), so I have to use the statically linked exe from Ubuntu. The C++ part runs correctly but it segfaults at the kernel function.
The machine has RHEL 6.10 with a GRID P40-8Q and driver 390.75. Unfortunately I can’t compile here.
These are the last lines of strace on the process just before it segfaults:
...
open("/global/lib//libnvidia-fatbinaryloader.so.390.75", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/global/lib/libnvidia-fatbinaryloader.so.390.75", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory)
open("/usr/lib64/libnvidia-fatbinaryloader.so.390.75", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0000Y\0\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=291496, ...}) = 0
mmap(NULL, 2407944, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f1d720fc000
mprotect(0x7f1d72139000, 2093056, PROT_NONE) = 0
mmap(0x7f1d72338000, 45056, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x3c000) = 0x7f1d72338000
mmap(0x7f1d72343000, 19976, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f1d72343000
close(3) = 0
open("/global/distlib//ld-linux-x86-64.so.2", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/global/distlib/ld-linux-x86-64.so.2", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/global/lib//ld-linux-x86-64.so.2", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/global/lib/ld-linux-x86-64.so.2", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory)
open("/lib64/ld-linux-x86-64.so.2", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0000\v@\3450\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=161776, ...}) = 0
mmap(0x30e5400000, 2236816, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x30e5400000
mprotect(0x30e5420000, 2097152, PROT_NONE) = 0
mmap(0x30e5620000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x20000) = 0x30e5620000
mmap(0x30e5622000, 400, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x30e5622000
close(3) = 0
mprotect(0x30e5620000, 4096, PROT_READ) = 0
mprotect(0x30e5f8a000, 16384, PROT_READ) = 0
mprotect(0x30e5a02000, 4096, PROT_READ) = 0
mprotect(0x30e6217000, 4096, PROT_READ) = 0
mprotect(0x30e6e06000, 4096, PROT_READ) = 0
mprotect(0x30e6682000, 4096, PROT_READ) = 0
set_tid_address(0x202bb90) = 11361
set_robust_list(0x202bba0, 24) = 0
futex(0x7ffcda6549ac, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7ffcda6549ac, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 1, NULL, 202b8c0) = -1 EAGAIN (Resource temporarily unavailable)
rt_sigaction(SIGRTMIN, {0x30e6005cb0, [], SA_RESTORER|SA_SIGINFO, 0x30e600f7e0}, NULL, 8) = 0
rt_sigaction(SIGRT_1, {0x30e6005d40, [], SA_RESTORER|SA_RESTART|SA_SIGINFO, 0x30e600f7e0}, NULL, 8) = 0
rt_sigprocmask(SIG_UNBLOCK, [RTMIN RT_1], NULL, 8) = 0
getrlimit(RLIMIT_STACK, {rlim_cur=10240*1024, rlim_max=RLIM64_INFINITY}) = 0
--- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0} ---
+++ killed by SIGSEGV (core dumped) +++
There are a few “No such file or directory” lines around there. There must be something missing in this executable. What do you guys suggest me to inspect/try?
=========== EDIT ===========
I’ve added the path of these missing libs to my LD_LIBRARY_PATH and tried again. While strace looks for the files there, it still says they can’t be found.
It fails around the launching of a kernel function, which is preceded by a few declarations of thrust::device_vector.
The sections that use thrust::host_vector are running correctly, but my next printf is just after the kernel finishes, so I don’t know if it crashed when allocating device_vector or running the kernel.
=========== EDIT 2 ============
I added /lib64 so it could find ld-linux-x86-64.so.2. All of these files are found now.
Maybe I need to get the ld-linux-x86-64.so.2 of the Glibc I used for compilation?
If you guys have done that successfully, compile and statically link CUDA on a distro, run on another, let me know if you have any advice.