However, I need to be able to compile the code for several architectures, which is why I need to use gencode. What is the correct syntax for doing so?
I tried several things: set(CMAKE_CUDA_FLAGS "-use_fast_math --relocatable-device-code=true -gencode arch=compute_70,code=sm_75 -dlto")
results in an error when processing the CMAKE-file: nvcc fatal : '-dlto' conflicts with '-gencode' to control what is generated; use 'code=lto_<arch>' with '-gencode' instead of '-dlto' to request lto intermediate|
Accordingly, I tried to use set(CMAKE_CUDA_FLAGS "-use_fast_math --relocatable-device-code=true -gencode arch=compute_75,code=lto_75")
which results in the error nvlink fatal : Link target of 'lto_75' is virtual target that is not JIT-able; use 'sm_' target instead
I also tried to use then set(CMAKE_CUDA_FLAGS "-use_fast_math --relocatable-device-code=true -gencode arch=compute_70,code=[lto_75,sm_75]")
which compiles, but runs with the same bad performance when ommiting the -dlto flag.
How I got the compile/link options to work in a bash script: #compile
… -gencode -arch=compute_70 -code=lto_70 --device-c <.cu files> #link
… -dc -arch=compute_70 -code=sm_70 --device-link <.o files> --cubin --output-file <.cubin file>
I imagine this works for a shared library instead of a cubin, but I haven’t verified that yet. I’m also using -std=c++17 to resolve some inlining issues, if that matters.
I am also really interested in how the LTO can be used cleanly together with cmake. I also added
set(CMAKE_CUDA_FLAGS -dlto -arch=sm_70")
into my cmake list and this works quite smoothly as well.
But actually, I would like to go with this new CMake variable: CMAKE_CUDA_ARCHITECTURES, which probably cannot yet work together with the LTO flag. So, If someone has a good idea, I would also be very happy.
If it helps, I was able to get CUDA link-time optimization working in MSVC when compiling for multiple architectures (in this case, a fatbinary containing PTX for sm_52, and SASS for sm_52 and sm_61).
Link -gencode=arch=compute_52,code=[sm_52,sm_61] -dlto
Note the -dlto option is required (but only at link time).
By looking at the generated SASS of the resulting binary, I was able to verify that link-time optimization was indeed occurring by checking that __device__ functions defined in .cu source files (rather than header files) were inlined.
After some testing, it appears that when using DLTO, you actually need to specify multiple -gencode options (i.e. one for each virtual arch / LTO intermediary arch pair), otherwise I was getting odd runtime errors.
Since I can’t edit my existing post, the correct options (at least by my testing) are as follows: