Why would putting __device__ functions in the same file as kernels make them faster?

I’m seeing a 10% slowdown when I put device functions called by my kernels in a file separate from the one the kernels themselves are in.

Does this make any sense? Shouldn’t this be equivalent as far as the compiler is concerned?

Because the compiler can inline functions that are in the same compilation unit. This definitely can make them run faster. If you put them in a separate compilation unit, the device function must be called via a normal call/return mechanism.

Ah, of course. I remember that inline doesn’t really do anything in CUDA - is there a way to allow inlining without the functions in the file (other than #include’ing the other files)?

Not at present.

Ok, thanks for the responses, all.