Multiple definition of __cudaRegisterLinkedBinary_* error with relocatable utilities

I am experimenting with relocatable device code and create the following file structures:

  • rdc_utils.h/rdc_utils.cu: contain declaration and definition of a (extern) __device__ function
  • gpu_exec_a.cu : contain a __global__ that calls the function in rdc_utils. The header gpu_exec_a.h only export a host function that launches the kernel.
  • gpu_exec_b.h/gpu_exec_b.cu : same as gpu_exec_a.h and gpu_exec_a.cu
  • main.cpp, main() call the two host functions from gpu_exec_a.h and gpu_exec_b.h

The files are compiled as follows

# --device-c step
nvcc --device-c rdc_utils.cu  -o rdc_utils.cu.o
nvcc --device-c gpu_exec_a.cu -o gpu_exec_a.cu.o
nvcc --device-c gpu_exec_b.cu -o gpu_exec_b.cu.o

# --device-link step
nvcc --device-link -o gpu_exec_a_dlink.o gpu_exec_a.cu.o rdc_utils.cu.o
nvcc --device-link -o gpu_exec_b_dlink.o gpu_exec_b.cu.o rdc_utils.cu.o

# --lib step
nvcc --lib -o gpu_exec_a.a gpu_exec_a.cu.o gpu_exec_a_dlink.o rdc_utils.cu.o
nvcc --lib -o gpu_exec_b.a gpu_exec_b.cu.o gpu_exec_b_dlink.o rdc_utils.cu.o

# host compile
g++ -c main.cpp -o main.o

# finall linking
nvcc --link -o out main.o gpu_exec_a.a gpu_exec_b.a

The final linking step fail with error:

nvcc --link -o out main.o gpu_exec_a.a gpu_exec_b.a
gpu_exec_a.a(gpu_exec_a_dlink.o): In function `__cudaRegisterLinkedBinary_20_rdc_utils_cu_cpp1_ii_5256946f':
link.stub:(.text+0x67): multiple definition of `__cudaRegisterLinkedBinary_20_rdc_utils_cu_cpp1_ii_5256946f'
gpu_exec_b.a(gpu_exec_b_dlink.o):link.stub:(.text+0x67): first defined here
/usr/bin/ld: error in /usr/local/cuda-11.2/bin/../targets/x86_64-linux/lib/libcudart_static.a(cudart_static.o)(.eh_frame); no .eh_frame_hdr table will be created.
collect2: error: ld returned 1 exit status

Using nm to inspect gpu_exec_a.a and gpu_exec_b.a clearly shows the symbol

0000000000000067 T __cudaRegisterLinkedBinary_20_rdc_utils_cu_cpp1_ii_5256946f

in both archives, so the error makes sense.

I can also just compile all four files in one single nvcc command. That produces a working binary. But there is no way to make the build system more modular in this manner. I’d like to be able to compile something like rdc_utils, a collection of device function once, and reuses in multiple translation units that has global function that calls the device functions in it.

How does one do that?