Performance loss when implementing with several .cu files

I developed the cuda project with several .c and .cu files
What i need to know, is why speed loss happens when one device function calls another device function in different .cu file.
and this happens in between .c file and .cu file, too.
What is the problem. and how can i avoid it?

and how do you measure the speed (loss)…?

device code linking across compilation units may result in device functions that are not inlined. inlining of device functions (or lack thereof) can have a signficant performance impact. Just speculation.

Yes, txbob, I thought it’s related to inlinning. so i tried to inline between functions with several tags like below.
forceinline device void func1();
__inline device void func1();
But it didn’t fix the problem, so i brought the called function to the cu file which caller function lay in.
How can i inline the device function in .cu file?

Hi jimmy, i measured the time duration with clock() function.

Inlining functions that reside in separately compiled compilation units would require inlining to occur at link time. To the best of my knowledge, the CUDA device linker does not currently support this functionality. For now, if function inlining is needed for performance reasons, those functions must be part o the same compilation unit, e.g. through use of #include’d files.

If you can demonstrate a significant difference in performance from the lack of inlining when using separate compilation, you may want to consider filing an RFE (request for enhancement) with NVIDIA, attaching your code as a relevant use case.

Side remark: Unless your kernels are extremely long-running, clock() likely has too low resolution to provide for an accurate performance assessment. A better choice could be gettimeofday() [or an equivalent Windows function, if on Windows].

oh, nice answer, Thanks a lot, njuffa.