Using dlink-time-opt together with gencode in CMAKE

_CUDACC_x · January 5, 2021, 5:10pm

I am trying to use the new link-time optimization flag dlto which was added with CUDA 11 (NVCC :: CUDA Toolkit Documentation) within a CMAKE project.

Setting the following works for me, and the performance is the same as when all of the critical algorithms are defined in the header:

set(CMAKE_CUDA_FLAGS "-use_fast_math -dlto -arch=sm_70")

However, I need to be able to compile the code for several architectures, which is why I need to use gencode. What is the correct syntax for doing so?

I tried several things:
set(CMAKE_CUDA_FLAGS "-use_fast_math --relocatable-device-code=true -gencode arch=compute_70,code=sm_75 -dlto")
results in an error when processing the CMAKE-file:
nvcc fatal : '-dlto' conflicts with '-gencode' to control what is generated; use 'code=lto_<arch>' with '-gencode' instead of '-dlto' to request lto intermediate|

Accordingly, I tried to use
set(CMAKE_CUDA_FLAGS "-use_fast_math --relocatable-device-code=true -gencode arch=compute_75,code=lto_75")

which results in the error
nvlink fatal : Link target of 'lto_75' is virtual target that is not JIT-able; use 'sm_' target instead

I also tried to use then
set(CMAKE_CUDA_FLAGS "-use_fast_math --relocatable-device-code=true -gencode arch=compute_70,code=[lto_75,sm_75]")
which compiles, but runs with the same bad performance when ommiting the -dlto flag.

mark.takatz · March 5, 2021, 7:26pm

How I got the compile/link options to work in a bash script:
#compile
… -gencode -arch=compute_70 -code=lto_70 --device-c <.cu files>
#link
… -dc -arch=compute_70 -code=sm_70 --device-link <.o files> --cubin --output-file <.cubin file>

I imagine this works for a shared library instead of a cubin, but I haven’t verified that yet. I’m also using -std=c++17 to resolve some inlining issues, if that matters.

soeren_bs · March 26, 2021, 2:43pm

I am also really interested in how the LTO can be used cleanly together with cmake. I also added
set(CMAKE_CUDA_FLAGS -dlto -arch=sm_70")
into my cmake list and this works quite smoothly as well.

But actually, I would like to go with this new CMake variable: CMAKE_CUDA_ARCHITECTURES, which probably cannot yet work together with the LTO flag. So, If someone has a good idea, I would also be very happy.

phw89 · May 4, 2021, 3:47pm

If it helps, I was able to get CUDA link-time optimization working in MSVC when compiling for multiple architectures (in this case, a fatbinary containing PTX for sm_52, and SASS for sm_52 and sm_61).

I used the following compile/link options:

Compile
-gencode=arch=compute_52,code=[compute_52,lto_52,lto_61]

Link
-gencode=arch=compute_52,code=[sm_52,sm_61] -dlto

Note the -dlto option is required (but only at link time).

By looking at the generated SASS of the resulting binary, I was able to verify that link-time optimization was indeed occurring by checking that __device__ functions defined in .cu source files (rather than header files) were inlined.

phw89 · May 10, 2021, 8:27am

After some testing, it appears that when using DLTO, you actually need to specify multiple -gencode options (i.e. one for each virtual arch / LTO intermediary arch pair), otherwise I was getting odd runtime errors.

Since I can’t edit my existing post, the correct options (at least by my testing) are as follows:

Compile

-gencode=arch=compute_52,code=[compute_52,lto_52]
-gencode=arch=compute_61,code=lto_61

Link

-gencode=arch=compute_52,code=sm_52
-gencode=arch=compute_61,code=sm_61
-dlto

The following also worked at link time, but not sure which is the “correct” way:

-gencode=arch=compute_52,code=[sm_52,sm_61] -dlto

Topic		Replies	Views
What happens when no arch flags passed by CMAKE CUDA Programming and Performance	3	614	April 3, 2024
Link-time optimization with CUDA on Linux (-flto) CUDA Programming and Performance	7	4992	May 31, 2019
CMAKE'ing a CUDA static library of kernels issues CUDA Programming and Performance	7	1222	October 6, 2023
Using device link-time optimization results in much larger fatbinaries CUDA Programming and Performance	4	559	September 21, 2021
Fail to compile cmake cuda project while using nvhpc nvc, nvc++ and nvfortran	5	2009	September 29, 2022
Slow compile and cudaMalloc CUDA Programming and Performance	8	3706	February 2, 2011
Improving GPU Application Performance with NVIDIA CUDA 11.2 Device Link Time Optimization Technical Blog	16	1497	September 6, 2024
CUDA_NVCC_FLAGS is being ignored in CMake CUDA Programming and Performance cuda , compile	4	6721	October 12, 2021
Does the JIT compiler perform device link-time optimization? CUDA Programming and Performance	3	1083	November 23, 2022
nvcc fatal : Compiler 'cl.exe' in PATH different than the one specified with -ccbin CUDA Setup and Installation	3	4224	October 3, 2021

Using dlink-time-opt together with gencode in CMAKE

Related topics