Link-time optimization with CUDA on Linux (-flto)

callum.burns · November 24, 2017, 5:36pm

GCC supports link-time optimization through the use of the flag -flto when building all compilation units and at the link stage. This can work like so:

g++ -flto -c -o main.o main.cpp
g++ -flto -c -o lib.o lib.cpp
g++ -flto -o program main.o lib.o

This works fine (using both GCC 6.4.0 and GCC 7.2.0) for regular C++ projects. However, trying this with CUDA:

nvcc -std=c++11 -Xcompiler -flto -c -o main.o main.cu
nvcc -std=c++11 -Xcompiler -flto -c -o lib.o lib.cu
nvcc -std=c++11 -Xcompiler -flto -o program main.o.lib.o

gives an error at the linking stage:

/tmp/ccH4zaS9.s: Assembler messages:
/tmp/ccH4zaS9.s:184: Error: symbol `fatbinData' is already defined
/tmp/ccH4zaS9.s:388: Error: symbol `fatbinData' is already defined
lto-wrapper: fatal error: /usr/x86_64-pc-linux-gnu/gcc-bin/6.4.0/g++ returned 1 exit status
compilation terminated.
/usr/lib/gcc/x86_64-pc-linux-gnu/6.4.0/../../../../x86_64-pc-linux-gnu/bin/ld: error: lto-wrapper failed

It seems to give as many "symbol fatbinData' is already defined" errors as there are compilation units being linked. I get the same problem both with a simple empty test program with just a main()` building to an executable and also in a real project building to a shared library. I also seem to experience the same thing both with GCC 5.4.0 + CUDA 8 and GCC 6.4.0 + CUDA 9.

Has anyone managed to get around this issue, or know of any other way get LTO to work with CUDA?

wlangdon · January 11, 2018, 1:33pm

Dear callum.burns,
Did you get anywhere with this?

I an compiling C code with gcc and my CUDA c++ code with nvcc
and linking them with nvcc.

When the linker command line includes -Xcompiler -flto it fails saying:

/tmp/tmpxft_…fatbin.c:660:33: warning: type of ‘fatbinData’ does not match original declaration [enabled by default]
/tmp/tmpxft_…_cuda_dlink.fatbin.c:34:33: note: previously declared here
extern const unsigned long long fatbinData[85];

/tmp/ccHXDZl1.s: Assembler messages:
/tmp/ccHXDZl1.s:981: Error: symbol `fatbinData’ is already defined
lto-wrapper: g++ returned 1 exit status
/usr/bin/ld: lto-wrapper failed
collect2: error: ld returned 1 exit status

The work around at present is to do without link time optimisation.
Did you try -Xcompiler -fno-lto

Bill
ps: this is with
gcc version 4.8.5 20150623 (Red Hat 4.8.5-16) (GCC)
Cuda compilation tools, release 7.0, V7.0.27

callum.burns · January 11, 2018, 1:48pm

Hi Bill,

I can compile CUDA programs just fine without LTO (simply by omitting -Xcompiler -flto). LTO isn’t exactly critical for any application, but it would be nice if this optimization feature of GCC / Clang could be exploited when using CUDA as it can speed up (the host code of) certain applications without the need to inline in header files all functions needed by hot code paths.

I haven’t yet figured out any workaround for this with nvcc.

njuffa · January 11, 2018, 6:42pm

I would suggest filing an enhancement request with NVIDIA. Use the online bug reporting form and prefix the synopsis with "RFE: " to mark it as an enhancement request rather than a functional bug.

If you can add data on the (estimated) positive performance-impact on your CUDA-accellerated application, that might be helpful, as there is likely cost/benefit analysis when NVIDIA decides which enhancement requests they will consider for a future CUDA version.

BulatZiganshin · January 12, 2018, 9:33am

one possibility is to compile parts of program independently, putting them into dlls, and then optionally link dlls back into exe with special programs. of course, this way you will have multiple copies of runtime, separate heaps and other DLL shortages

wlangdon · March 10, 2018, 11:16am

Dear callum.burns and njuffa

I have submitted an RFE bug report as suggested, see
https://developer.nvidia.com/nvidia_bug/2081246

Bill

kiroma · April 14, 2018, 5:49pm

The error you’re getting is caused by the compiler treating an LTO object (fatbinData) as a regular part of the code. It seems NVCC is not capable of handling LTO at this moment.

ragerdl · May 31, 2019, 11:28pm

Tried to +1 your feature request but ran into some sort of “membership required.” Having to put implementation is in header files so nvcc can inline is a bit annoying but workable at least.

Topic		Replies	Views
Cuda nvJitLink error because fatbin does not contains the correct function CUDA Programming and Performance	4	88	September 11, 2024
Linking frustration -lcuda fails CUDA Programming and Performance	7	27504	November 27, 2009
CUDA 12.0 Compiler Support for Runtime LTO Using nvJitLink Library Technical Blog	6	635	August 22, 2024
Improving GPU Application Performance with NVIDIA CUDA 11.2 Device Link Time Optimization Technical Blog	16	1534	September 6, 2024
Struggling with CUDA, Clang and LLVM IR on PowerPC, and getting: CUDA failure: 'Invalid device function' nvc, nvc++ and nvfortran	6	1569	October 12, 2021
Compilling with nvc++ nvc, nvc++ and nvfortran cuda	8	971	November 28, 2023
How to use nvrtc && nvjit? CUDA Programming and Performance cuda	3	115	August 30, 2024
Ubuntu 20.04, GCC 9.3, Cuda Toolkit 11.3 - not a supported combination? CUDA Programming and Performance	11	9063	November 4, 2021
gcc 4.4 support anytime soon? CUDA Programming and Performance	24	108125	April 9, 2010
Linking Error: "fatbindata already defined" and LTO wrapper issue CUDA Programming and Performance	0	129	January 25, 2025

Link-time optimization with CUDA on Linux (-flto)

Related topics