What happens when no arch flags passed by CMAKE

Normally I use below to add pass gencode to compiler.

cuda_select_nvcc_arch_flags(ARCH_FLAGS Auto) 

However when I want to activate link time optimizations this throws

nvcc fatal   : '-dlto' conflicts with '-gencode' to control what is generated; use 'code=lto_<arch>' with '-gencode' instead of '-dlto' to request lto intermediate
nvcc fatal   : '-dlto' conflicts with '-gencode' to control what is generated; use 'code=lto_<arch>' with '-gencode' instead of '-dlto' to request lto intermediate

Now I’m wondering 2 things.

  1. What happens if i dont set any kinde of arch info manually anywhere? Does cmake or nvcc still automatically find my GPU’s arch bc I still get working executables and libs.
  2. How can I activate LTO by using cuda_select_nvcc_arch_flags(ARCH_FLAGS Auto) or any other consice and generic method.

Every version of nvcc has a built-in default target architecture. One way to find out (other than reading the documentation) is to inspect the output from building with nvcc -v, i.e. a verbose build. For example, for the CUDA 12.3 toolchain this shows -D__CUDA_ARCH__=520, -arch compute52, and --arch=sm52 being passed to various components of the toolchain (remember that nvcc is just the driver program which invokes these components under the hood), so we can conclude with confidence that the default architecture target for CUDA12.3 is compute capability 5.2.

If you specify -arch=native on the nvcc commandline, it will iterate over all visible (see CUDA_VISIBLE_DEVICES) GPUs in your system and add code for each architecture found to the fat binary.

I consider questions about Cmake off-topic here, as it is not a tool shipped or created by NVIDIA, and I neither use nor endorse it. Cmake users among forum participants may be able to provide those details.

1 Like

Yes I could see that is also the case for me. But eventhough my arch isn’t 5.2(it is 8.6) the program runs smoothly. So I wonder what is the exact effects of this architecture specification.

If the CUDA runtime cannot find a binary image that matches the compute capability of the GPU present, it will look for suitable PTX and JIT-compile it. Depending on the amount of the device code that needs to be translated that could cause a noticeable delay. If neither a suitable binary image nor suitable PTX is available, kernel execution fails. This is described in the CUDA documentation.

It is a best practice with CUDA to build a fat binary that contains SASS (machine code) for all GPU architectures that need to be supported, plus PTX for the latest GPU architecture for forward compatibility (this code can be JIT compiled). If you do that manually, it might look like this:

-gencode=arch=compute_50,code=sm_50 \  
-gencode=arch=compute_52,code=sm_52 \  
-gencode=arch=compute_60,code=sm_60 \  
-gencode=arch=compute_61,code=sm_61 \  
-gencode=arch=compute_70,code=sm_70 \  
-gencode=arch=compute_75,code=sm_75 \ 
-gencode=arch=compute_80,code=sm_80 \ 
-gencode=arch=compute_86,code=sm_86 \ 
-gencode=arch=compute_89,code=sm_89 \
-gencode=arch=compute_90,code=sm_90 \ 

nvcc also offers shortcuts in the form of command-line switches -arch=all and -arch=all-major. See compiler documentation.