The CUDA program generated using nvRTC and nvJitLink produces incorrect results

My program consists of two parts: one part includes some fixed functions stored in the static.cu file, and the other part needs to be dynamically generated based on the task. The generated content is a __global__ function that needs to call functions defined in static.cu. To achieve this, I compiled static.cu offline, then dynamically generated dyn.cu according to the task requirements and used nvRTC to compile it. Finally, I used nvJitLink to link the two parts into cubin. However, the calculation results are incorrect. I’m not sure if my compilation and linking methods are correct. Here are the specific steps I followed:

  1. Compile the static.cu file offline into object file

nvcc -gencode=arch=compute_52,code=[compute_52,lto_52] -O3 -std=c++17 -rdc=true -Xcompiler -fPIC -MD -MT static.o -o static.o static.cu

  1. Compile the dynamically generated dyn.cu file using nvRTC
nvrtcProgram prog;
nvrtcCreateProgram(&prog,  content_of_dyn_cu, nullptr, 0, nullptr, nullptr);// content_of_dyn_cu is  a pointer of type const char *;
//Call nvrtcCompileProgram with options "-dlto", "-rdc=true", "--device-int128", "-arch=compute_52"
//Call nvrtcGetLTOIR to get lto IR of static.cu
  1. Use nvJitLink to link the offline compiled static.o with the LTOIR of the dynamically compiled dyn.cu to generate PTX.
//Call nvJitLinkCreate with options "-lto",  "-arch=sm_86"
//Call the nvJitLinkAddData function to add the LTO IR of dyn.cu and static.o
//Call nvJitLinkComplete
  1. Call nvJitLinkGetLinkedCubin to get the cubin.

The program produced by the above steps yields incorrect results. However, if both parts are compiled and linked using nvcc, the calculation results are correct. Below are the parameters used for compilation and linking.

nvcc -dc -gencode=arch=compute_52,code=[compute_52,lto_52] -std=c++17 -O3 -o static.o static.cu
nvcc -dc -gencode=arch=compute_52,code=[compute_52,lto_52] -std=c++17 -O3 -o dyn.o dyn.cu
nvcc -cubin -dlink -arch=compute_52 -code=sm_86 -o final.cubin dyn.o static.o 

Here is my question:

  1. When performing offline compilation using only nvcc, are the compilation options correct (i.e., is LTO enabled)?
  2. What are the differences between performing a full offline compilation using only nvcc and a semi-offline, semi-real-time compilation using nvcc, nvRTC, and nvJitLink? Why does the program compile successfully with nvcc alone, but encounter issues when using nvcc, nvRTC, and nvJitLink for the semi-offline, semi-real-time compilation?

CUDA:Cuda compilation tools, release 12.6, V12.6.20
OS: Ubuntu 20.04.5 LTS
GPU: Geforce 3080

why the mix of 52 and 86?
for the code that produces incorrect results, did you run it with compute-sanitizer?

My initial goal was to generate PTX code so that the program could run on different GPUs. Therefore, I specified the virtual architecture as compute_52. Later, since nvcc seemed unable to link and generate PTX, I unified the comparison by generating cubin files, and thus specified the actual hardware architecture as sm_86. I then revised the code, modifying the virtual architecture in the compilation process to compute_86. However, the result was still incorrect. Additionally, I tested using compute-sanitizer and no errors were reported. Below is the full command I used when testing with compute-sanitizer:
compute-sanitizer my_program
To add, the program itself is a Rust program that calls a library developed using CUDA at a lower level. I’m not sure whether compute-sanitizer can still detect errors in this case.

What is the relationship between nvcc, nvRTC, and nvJitLink? I know that during the nvcc compilation process, other tools provided by the CUDA toolkit are invoked to complete the entire compilation and linking process. Are these tools (such as cicc, ptxas) also using nvRTC and nvJitLink for compilation and linking? Or is there some common library that implements the compilation and linking logic, which these tools, as well as nvRTC and nvJitLink, all use?

nvrtc (and nvjitlink) need to be able to work with only a GPU driver install, not a CUDA toolkit install. So it does not depend on tools like ptxas or cicc, nor any other libraries other than what you link in, and what is provided by the GPU driver (libcuda.so). You can always use ldd to check what dynamic library dependencies there are. You might also wish to begin debug on the version that produces incorrect results. It may allow you to eventually create a simple, shareable reproducer.