Using dlink-time-opt together with gencode in CMAKE

I am trying to use the new link-time optimization flag dlto which was added with CUDA 11 (https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#optimization-of-separate-compilation) within a CMAKE project.

Setting the following works for me, and the performance is the same as when all of the critical algorithms are defined in the header:

set(CMAKE_CUDA_FLAGS "-use_fast_math -dlto -arch=sm_70")

However, I need to be able to compile the code for several architectures, which is why I need to use gencode. What is the correct syntax for doing so?

I tried several things:
set(CMAKE_CUDA_FLAGS "-use_fast_math --relocatable-device-code=true -gencode arch=compute_70,code=sm_75 -dlto")
results in an error when processing the CMAKE-file:
nvcc fatal : '-dlto' conflicts with '-gencode' to control what is generated; use 'code=lto_<arch>' with '-gencode' instead of '-dlto' to request lto intermediate|

Accordingly, I tried to use
set(CMAKE_CUDA_FLAGS "-use_fast_math --relocatable-device-code=true -gencode arch=compute_75,code=lto_75")

which results in the error
nvlink fatal : Link target of 'lto_75' is virtual target that is not JIT-able; use 'sm_' target instead

I also tried to use then
set(CMAKE_CUDA_FLAGS "-use_fast_math --relocatable-device-code=true -gencode arch=compute_70,code=[lto_75,sm_75]")
which compiles, but runs with the same bad performance when ommiting the -dlto flag.

1 Like

How I got the compile/link options to work in a bash script:
#compile
… -gencode -arch=compute_70 -code=lto_70 --device-c <.cu files>
#link
… -dc -arch=compute_70 -code=sm_70 --device-link <.o files> --cubin --output-file <.cubin file>

I imagine this works for a shared library instead of a cubin, but I haven’t verified that yet. I’m also using -std=c++17 to resolve some inlining issues, if that matters.

I am also really interested in how the LTO can be used cleanly together with cmake. I also added
set(CMAKE_CUDA_FLAGS -dlto -arch=sm_70")
into my cmake list and this works quite smoothly as well.

But actually, I would like to go with this new CMake variable: CMAKE_CUDA_ARCHITECTURES, which probably cannot yet work together with the LTO flag. So, If someone has a good idea, I would also be very happy.

If it helps, I was able to get CUDA link-time optimization working in MSVC when compiling for multiple architectures (in this case, a fatbinary containing PTX for sm_52, and SASS for sm_52 and sm_61).

I used the following compile/link options:

Compile
-gencode=arch=compute_52,code=[compute_52,lto_52,lto_61]

Link
-gencode=arch=compute_52,code=[sm_52,sm_61] -dlto

Note the -dlto option is required (but only at link time).

By looking at the generated SASS of the resulting binary, I was able to verify that link-time optimization was indeed occurring by checking that __device__ functions defined in .cu source files (rather than header files) were inlined.

After some testing, it appears that when using DLTO, you actually need to specify multiple -gencode options (i.e. one for each virtual arch / LTO intermediary arch pair), otherwise I was getting odd runtime errors.

Since I can’t edit my existing post, the correct options (at least by my testing) are as follows:

Compile

-gencode=arch=compute_52,code=[compute_52,lto_52]
-gencode=arch=compute_61,code=lto_61

Link

-gencode=arch=compute_52,code=sm_52
-gencode=arch=compute_61,code=sm_61
-dlto

The following also worked at link time, but not sure which is the “correct” way:

-gencode=arch=compute_52,code=[sm_52,sm_61] -dlto