Using dlink-time-opt together with gencode in CMAKE

I am trying to use the new link-time optimization flag dlto which was added with CUDA 11 (https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#optimization-of-separate-compilation) within a CMAKE project.

Setting the following works for me, and the performance is the same as when all of the critical algorithms are defined in the header:

set(CMAKE_CUDA_FLAGS "-use_fast_math -dlto -arch=sm_70")

However, I need to be able to compile the code for several architectures, which is why I need to use gencode. What is the correct syntax for doing so?

I tried several things:
set(CMAKE_CUDA_FLAGS "-use_fast_math --relocatable-device-code=true -gencode arch=compute_70,code=sm_75 -dlto")
results in an error when processing the CMAKE-file:
nvcc fatal : '-dlto' conflicts with '-gencode' to control what is generated; use 'code=lto_<arch>' with '-gencode' instead of '-dlto' to request lto intermediate|

Accordingly, I tried to use
set(CMAKE_CUDA_FLAGS "-use_fast_math --relocatable-device-code=true -gencode arch=compute_75,code=lto_75")

which results in the error
nvlink fatal : Link target of 'lto_75' is virtual target that is not JIT-able; use 'sm_' target instead

I also tried to use then
set(CMAKE_CUDA_FLAGS "-use_fast_math --relocatable-device-code=true -gencode arch=compute_70,code=[lto_75,sm_75]")
which compiles, but runs with the same bad performance when ommiting the -dlto flag.