CUDA Dynamic Parallelism undefined reference to __fatbinwrap

(this is a cross-post from a stackoverflow question)

I have a program containing separately-compiled CUDA and Thrust code (thrust_search.cu), built as follows:

nvcc -c -I/path/to/thrust/ ./src/thrust_search.cu

pgcpp -acc -Minfo -I/path/to/thrust/ -I./ -lrt -I/opt/pgi/linux86-64/2014/cuda/6.5/include/ -L/opt/pgi/linux86-64/2014/cuda/6.5/lib64/ -lcurand -lcudart -o main main.cpp thrust_search.o

The program builds and run fine, but I’d like to activate Dynamic Parallelism. This requires relocatable device code, sm_35 and the cudadevrt library. Furthermore, the use of device relocatable code requires that the device code be compiled and linked in two separate steps. I therefore changed to the following build commands:

nvcc --gpu-architecture=sm_35 --device-c -I/path/to/thrust/ ./src/thrust_search.cu
nvcc --gpu-architecture=sm_35 --device-link thrust_search.o --output-file link.o -lcudadevrt 

pgcpp -acc -Minfo -I/path/to/thrust/ -I./ -lrt -I/opt/pgi/linux86-64/2014/cuda/6.5/include/ -L/opt/pgi/linux86-64/2014/cuda/6.5/lib64/ -lcurand -lcudart -lcudadevrt -o main main.cpp thrust_search.o link.o

I’m now getting the following errors on compilation:

nvlink warning : SM Arch ('sm_20') not found in 'thrust_search.o'
nvlink warning : SM Arch ('sm_30') not found in 'thrust_search.o'
link.o: In function `__cudaRegisterLinkedBinary_66_tmpxft_00007dce_00000000_12_cuda_device_runtime_compute_50_cpp1_ii_5f6993ef':
link.stub:(.text+0x98): undefined reference to `__fatbinwrap_66_tmpxft_00007dce_00000000_12_cuda_device_runtime_compute_50_cpp1_ii_5f6993ef'
pgacclnk: child process exit status 1: /usr/bin/ld

Similar problems I was able to find elsewhere (1, 2, 3, 4, 5) all seem to have been fixed by linking the cudadevrt or cudart library, specifying the sm_35 architecture and compiling and linking the device code in two steps as I’m already doing.

My LD_LIBRARY_PATH contains the path to the libcudadevrt.a file, /usr/local/cuda/lib64, so I do believe that the library is being found. It’s like the library isn’t actually getting linked in. By the way, the error arises only at the pgcpp command stage, not during nvcc compilation or linkage. I’m thinking the problem might have something to do with confusion between PGI CUDA libraries in /opt/pgi/linux86-64/2014/cuda/6.5/lib64/ and the NVIDIA CUDA libraries in /usr/local/cuda/lib64/ which both contain the libcudadevrt.a file.

Hi LO_UZH,

PGI uses RDC by default so linking shouldn’t be a problem. However also by default, we generate binaries for different compute capabilities. To specifically target compute capability 3.5, add the flag “-ta=tesla:cc35”. This is similar to specifying “–gpu-architecture=sm_35”.

pgcpp -acc -ta=tesla:cc35 -Minfo -I/path/to/thrust/ -I./ -lrt -I/opt/pgi/linux86-64/2014/cuda/6.5/include/ -L/opt/pgi/linux86-64/2014/cuda/6.5/lib64/ -lcurand -lcudart -o main main.cpp thrust_search.o

Please let us know if this works for you.

Best Regards,
Mat

Hi Mat,

That didn’t fix it unfortunately. The problem doesn’t seem to be with the specification of the sm_35 architecture, but actually the nvcc linking stage. If I remove the linking stage and do not specify device relocatable code, compilation works fine:

nvcc --gpu-architecture=sm_35 -c -I/home/cef13_pp/thrust-v1.8/ ./src/thrust_search.cu -lcudadevrt

pgcpp -acc -ta=tesla:cc35 -Minfo -I/path/to/thrust/ -I./ -lrt -I/opt/pgi/linux86-64/2014/cuda/6.5/include/ -L/opt/pgi/linux86-64/2014/cuda/6.5/lib64/ -lcurand -lcudart -lcudadevrt -o main main.cpp thrust_search.o

Hi LO_UZH,

Could you please send an example to PGI Customer Service (trs@pgroup.com) and ask them to send it to me?

I’d like to try to reproduce the problem here and see exactly what’s going on.

Thanks,
Mat

Hi Mat,

I just sent it in. Hope we can figure out what’s wrong!

Thanks!

Hi Laurent,

It took me a bit, but was able to recreate the error. I was using PGI with CUDA 6.5 and it linked fine (though the executable got a runtime error). Final, I moved to using CUDA 7.0 and replicated the error. The fix was to just add “-Mcuda -pgf90libs”.

Note that we just released CUDA 7.0 support in the 15.4 compilers. I wasn’t sure which compiler and CUDA version you are using but I could only reproduce the link error in 15.4 with CUDA 7.0.

Here’s my output:

% make -f makefile_error
nvcc --gpu-architecture=sm_35 --device-c -I/proj/qa/support/LO_UZH/thrust ./src/thrust_search.cu
nvcc --gpu-architecture=sm_35 --device-link thrust_search.o --output-file link.o -lcudadevrt
pgc++ -w -V15.4 -acc -ta=tesla:cc35,cuda7.0 -L/proj/pgi/linux86-64/2015/cuda/7.0/lib64 -I./ -I/proj/pgi/linux86-64/2015/cuda/7.0/include  -lrt -lcurand -lcudart -lcudadevrt -o wrapper wrapper.cpp ./src/demand.cpp ./src/excessdemand.cpp ./src/marketclearing.cpp ./src/raberto01.cpp ./src/standard_deviation.cpp ./src/supply.cpp thrust_search.o link.o
wrapper.cpp:
./src/demand.cpp:
./src/excessdemand.cpp:
./src/marketclearing.cpp:
./src/raberto01.cpp:
./src/standard_deviation.cpp:
./src/supply.cpp:
link.o: In function `__cudaRegisterLinkedBinary_66_tmpxft_000073f4_00000000_16_cuda_device_runtime_compute_52_cpp1_ii_8b1a5d37':
link.stub:(.text+0x98): undefined reference to `__fatbinwrap_66_tmpxft_000073f4_00000000_16_cuda_device_runtime_compute_52_cpp1_ii_8b1a5d37'
pgacclnk: child process exit status 1: /usr/bin/ld
make: *** [wrapper] Error 2

% make -f makefile_error
nvcc --gpu-architecture=sm_35 --device-c -I/proj/qa/support/LO_UZH/thrust ./src/thrust_search.cu
nvcc --gpu-architecture=sm_35 --device-link thrust_search.o --output-file link.o -lcudadevrt
pgc++ -w -V15.4 -acc -ta=tesla:cc35,cuda7.0 -Mcuda -pgf90libs -L/proj/pgi/linux86-64/2015/cuda/7.0/lib64 -I./ -I/proj/pgi/linux86-64/2015/cuda/7.0/include  -lrt -lcurand -lcudart -lcudadevrt -o wrapper wrapper.cpp ./src/demand.cpp ./src/excessdemand.cpp ./src/marketclearing.cpp ./src/raberto01.cpp ./src/standard_deviation.cpp ./src/supply.cpp thrust_search.o link.o
wrapper.cpp:
./src/demand.cpp:
./src/excessdemand.cpp:
./src/marketclearing.cpp:
./src/raberto01.cpp:
./src/standard_deviation.cpp:
./src/supply.cpp:
% wrapper
25.4666
  • Mat