The KISS method I often use to measure function overhead is really crude but is so easy that it can quickly tell you if a certain routine is a bottleneck or not.
The trick is to first benchmark your unchanged code, finding total runtime with a typical input as a base reference.
Now you go back to your source and DOUBLE the invocations of your device functions in question. You can’t eliminate them (since that could change your programs behavior) but you can do things like changing
x=myDeviceFunction(y, z);
to
x=myDeviceFunction(y, z);
x=(x+myDeviceFunction(y, z))/2;
The exact “doubling” depends on your device function, side effects, whether it returns an integer or float, etc.
Now you compare the timing with the “doubled” function and you’ll get a SLOWER runtime since you’re doing more work. The difference in runtimes, divided by your initial runtime, is the fraction of time that the device function in question uses… allowing you to see if it was a bottleneck and how much.
This all works even if code is expanded and inlined, etc. (You do have to be careful that your “doubling” can’t be optimized away by the compiler, so make sure return values are used…)
Simple, crude, and pretty inelegant but it works pretty well and you can do it in 30 seconds. I usually get more intuition from such crude tests than I do from the profiler.
If your device function is called from many places, you can just make a wrapper… that is, rename the device function to “REALmyDeviceFunction” then make a “myDeviceFunction” that does the double invocation. all your orginal code doesn’t need to change since it’s still calling “myDeviceFunction”.