I am experimenting with relocatable device code and create the following file structures:
- rdc_utils.h/rdc_utils.cu: contain declaration and definition of a (extern)
__device__
function - gpu_exec_a.cu : contain a
__global__
that calls the function in rdc_utils. The header gpu_exec_a.h only export a host function that launches the kernel. - gpu_exec_b.h/gpu_exec_b.cu : same as gpu_exec_a.h and gpu_exec_a.cu
- main.cpp,
main()
call the two host functions from gpu_exec_a.h and gpu_exec_b.h
The files are compiled as follows
# --device-c step
nvcc --device-c rdc_utils.cu -o rdc_utils.cu.o
nvcc --device-c gpu_exec_a.cu -o gpu_exec_a.cu.o
nvcc --device-c gpu_exec_b.cu -o gpu_exec_b.cu.o
# --device-link step
nvcc --device-link -o gpu_exec_a_dlink.o gpu_exec_a.cu.o rdc_utils.cu.o
nvcc --device-link -o gpu_exec_b_dlink.o gpu_exec_b.cu.o rdc_utils.cu.o
# --lib step
nvcc --lib -o gpu_exec_a.a gpu_exec_a.cu.o gpu_exec_a_dlink.o rdc_utils.cu.o
nvcc --lib -o gpu_exec_b.a gpu_exec_b.cu.o gpu_exec_b_dlink.o rdc_utils.cu.o
# host compile
g++ -c main.cpp -o main.o
# finall linking
nvcc --link -o out main.o gpu_exec_a.a gpu_exec_b.a
The final linking step fail with error:
nvcc --link -o out main.o gpu_exec_a.a gpu_exec_b.a
gpu_exec_a.a(gpu_exec_a_dlink.o): In function `__cudaRegisterLinkedBinary_20_rdc_utils_cu_cpp1_ii_5256946f':
link.stub:(.text+0x67): multiple definition of `__cudaRegisterLinkedBinary_20_rdc_utils_cu_cpp1_ii_5256946f'
gpu_exec_b.a(gpu_exec_b_dlink.o):link.stub:(.text+0x67): first defined here
/usr/bin/ld: error in /usr/local/cuda-11.2/bin/../targets/x86_64-linux/lib/libcudart_static.a(cudart_static.o)(.eh_frame); no .eh_frame_hdr table will be created.
collect2: error: ld returned 1 exit status
Using nm to inspect gpu_exec_a.a
and gpu_exec_b.a
clearly shows the symbol
0000000000000067 T __cudaRegisterLinkedBinary_20_rdc_utils_cu_cpp1_ii_5256946f
in both archives, so the error makes sense.
I can also just compile all four files in one single nvcc
command. That produces a working binary. But there is no way to make the build system more modular in this manner. I’d like to be able to compile something like rdc_utils
, a collection of device function once, and reuses in multiple translation units that has global function that calls the device functions in it.
How does one do that?