Significant difference in execution time when stepping inlined functions

Hello,

Whilst debugging, I am experiencing significant differences in execution time when stepping over functions inlined (forceinline), compared to the same functions not inlined; this is not to be expected, is it?

A function inlined may take a few minutes to complete, whilst the same function not inlined may only take a few seconds

The function results are identical in both cases, so it can not be that the functions self are improper

This is reproducible (on my machine at least) - within a function, do a simple sum scan over the wasp block; copy this a few times to get the function instruction count up a bit; in the first run, inline the function in a test kernel; in the subsequent run, do not inline the same function

Compute capability = 3.5

Thanks for reporting this issue. Are you testing this with the CUDA 5.5 toolkit, or CUDA 6.0? In CUDA 6.0, single-stepping performance optimizations were introduced, which are enabled by default and should help accelerate this use case (this can be toggled with the “set cuda single_stepping_optimizations [on|off]” command in the 6.0 version of cuda-gdb).

This is also so for cuda 6

Thanks for checking on that.

We have identified the performance issue, and are working to resolve it.

The same issue can also be reproduced on the host with gdb when stepping over inline subroutines (using always_inline) with many instructions, or large loops. Note that the performance delay for this inline routine does not take place when your application is free running. It is limited to stepping over the inline routine in the debugger. Stepping over non-inline routines is much faster, as it can resume execution to the point that the call returns.

Here are some workarounds that may assist you when needing to step over inlined subroutines with 6.0:

- Use the "until [line]" command.  By specifying the next line to stop at, it will achieve the same behavior as attempting to step over the inlined function call, but will be faster.
- avoid using the _forceinline_ keyword (as you have discovered) on functions that have many instructions (large loops, etc.)

I will keep you updated as we work on this, and apologize for any inconvenience.