Whilst debugging, I am experiencing significant differences in execution time when stepping over functions inlined (forceinline), compared to the same functions not inlined; this is not to be expected, is it?
A function inlined may take a few minutes to complete, whilst the same function not inlined may only take a few seconds
The function results are identical in both cases, so it can not be that the functions self are improper
This is reproducible (on my machine at least) - within a function, do a simple sum scan over the wasp block; copy this a few times to get the function instruction count up a bit; in the first run, inline the function in a test kernel; in the subsequent run, do not inline the same function
Compute capability = 3.5