I would like to know more about the consequences of compiling with -rdc=true.
From what I understand the only cost comes from the compiler not being able to optimize function calls ACROSS compilation units. That is not being able to inline a small device function from another compilation unit.
I would like to compile with rdc in order to communicate constant memory across files (using extern) and pass device functions across compilation units in my application.
My application is conformed by many .cu files which are currently independent of each other (no external constant memory and no communication of device functions between files), so no rdc is needed and I can compile each one separately using only “$nvcc -c”.
When I compile said application by using “$nvcc -dc” or -rdc=true the performance drops a noticeable amount in kernels that, as I said before, only use device functions that are on the same C.U.
In this case I would expect the compiler to still inline the same device functions as before (and it appears to be doing so). So I do not understand what else is rdc doing to harm performance.
LONG STORY SHORT
Besides function call optimization across compilation units, is there any other performance consequence of using relocatable device code?