Link-time optimization with CUDA on Linux (-flto)

GCC supports link-time optimization through the use of the flag -flto when building all compilation units and at the link stage. This can work like so:

g++ -flto -c -o main.o main.cpp
g++ -flto -c -o lib.o lib.cpp
g++ -flto -o program main.o lib.o

This works fine (using both GCC 6.4.0 and GCC 7.2.0) for regular C++ projects. However, trying this with CUDA:

nvcc -std=c++11 -Xcompiler -flto -c -o main.o main.cu
nvcc -std=c++11 -Xcompiler -flto -c -o lib.o lib.cu
nvcc -std=c++11 -Xcompiler -flto -o program main.o.lib.o

gives an error at the linking stage:

/tmp/ccH4zaS9.s: Assembler messages:
/tmp/ccH4zaS9.s:184: Error: symbol `fatbinData' is already defined
/tmp/ccH4zaS9.s:388: Error: symbol `fatbinData' is already defined
lto-wrapper: fatal error: /usr/x86_64-pc-linux-gnu/gcc-bin/6.4.0/g++ returned 1 exit status
compilation terminated.
/usr/lib/gcc/x86_64-pc-linux-gnu/6.4.0/../../../../x86_64-pc-linux-gnu/bin/ld: error: lto-wrapper failed

It seems to give as many "symbol fatbinData' is already defined" errors as there are compilation units being linked. I get the same problem both with a simple empty test program with just a main()` building to an executable and also in a real project building to a shared library. I also seem to experience the same thing both with GCC 5.4.0 + CUDA 8 and GCC 6.4.0 + CUDA 9.

Has anyone managed to get around this issue, or know of any other way get LTO to work with CUDA?

Dear callum.burns,
Did you get anywhere with this?

I an compiling C code with gcc and my CUDA c++ code with nvcc
and linking them with nvcc.

When the linker command line includes -Xcompiler -flto it fails saying:

/tmp/tmpxft_…fatbin.c:660:33: warning: type of ‘fatbinData’ does not match original declaration [enabled by default]
/tmp/tmpxft_…_cuda_dlink.fatbin.c:34:33: note: previously declared here
extern const unsigned long long fatbinData[85];

/tmp/ccHXDZl1.s: Assembler messages:
/tmp/ccHXDZl1.s:981: Error: symbol `fatbinData’ is already defined
lto-wrapper: g++ returned 1 exit status
/usr/bin/ld: lto-wrapper failed
collect2: error: ld returned 1 exit status

The work around at present is to do without link time optimisation.
Did you try -Xcompiler -fno-lto

Bill
ps: this is with
gcc version 4.8.5 20150623 (Red Hat 4.8.5-16) (GCC)
Cuda compilation tools, release 7.0, V7.0.27

Hi Bill,

I can compile CUDA programs just fine without LTO (simply by omitting -Xcompiler -flto). LTO isn’t exactly critical for any application, but it would be nice if this optimization feature of GCC / Clang could be exploited when using CUDA as it can speed up (the host code of) certain applications without the need to inline in header files all functions needed by hot code paths.

I haven’t yet figured out any workaround for this with nvcc.

I would suggest filing an enhancement request with NVIDIA. Use the online bug reporting form and prefix the synopsis with "RFE: " to mark it as an enhancement request rather than a functional bug.

If you can add data on the (estimated) positive performance-impact on your CUDA-accellerated application, that might be helpful, as there is likely cost/benefit analysis when NVIDIA decides which enhancement requests they will consider for a future CUDA version.

one possibility is to compile parts of program independently, putting them into dlls, and then optionally link dlls back into exe with special programs. of course, this way you will have multiple copies of runtime, separate heaps and other DLL shortages

Dear callum.burns and njuffa

I have submitted an RFE bug report as suggested, see
https://developer.nvidia.com/nvidia_bug/2081246

Bill

The error you’re getting is caused by the compiler treating an LTO object (fatbinData) as a regular part of the code. It seems NVCC is not capable of handling LTO at this moment.

Tried to +1 your feature request but ran into some sort of “membership required.” Having to put implementation is in header files so nvcc can inline is a bit annoying but workable at least.