I’m seeing a 10% slowdown when I put device functions called by my kernels in a file separate from the one the kernels themselves are in.
Does this make any sense? Shouldn’t this be equivalent as far as the compiler is concerned?
I’m seeing a 10% slowdown when I put device functions called by my kernels in a file separate from the one the kernels themselves are in.
Does this make any sense? Shouldn’t this be equivalent as far as the compiler is concerned?
Because the compiler can inline functions that are in the same compilation unit. This definitely can make them run faster. If you put them in a separate compilation unit, the device function must be called via a normal call/return mechanism.
Ah, of course. I remember that inline doesn’t really do anything in CUDA - is there a way to allow inlining without the functions in the file (other than #include’ing the other files)?
Not at present.
Ok, thanks for the responses, all.