Kernel invocation cputime overhead?

First venture into CUDA (1.1; Linux). Still naive. (a) Does not crash, (b) results are correct, and © occasionally faster than CPU. But those aside, I was looking at the profile log. Manual says that cputime includes the gputime of the kernel. But what else does it include? Preceding mallocs?memcopies? Or more importantly, is there anything that I could do in order to squeeze the “overhead”?

Curiously, the first call to “F3” (and “F4”) has more overhead than later calls. Data/block/thread size is same on those calls.

method=[ memcopy ] gputime=[ 3.776 ]

method=[ F1 ] gputime=[ 64.000 ] cputime=[ 124.000 ] occupancy=[ 1.000 ]

method=[ F2 ] gputime=[ 7.936 ] cputime=[ 64.000 ] occupancy=[ 1.000 ]

method=[ memcopy ] gputime=[ 2.944 ]

method=[ memcopy ] gputime=[ 3.712 ]

method=[ F1 ] gputime=[ 49.376 ] cputime=[ 100.000 ] occupancy=[ 1.000 ]

method=[ F2 ] gputime=[ 7.264 ] cputime=[ 52.000 ] occupancy=[ 1.000 ]

method=[ memcopy ] gputime=[ 2.816 ]

method=[ memcopy ] gputime=[ 4.480 ]

method=[ memcopy ] gputime=[ 3.872 ]

method=[ F3 ] gputime=[ 62.368 ] cputime=[ 113.000 ] occupancy=[ 0.667 ]

method=[ memcopy ] gputime=[ 38.272 ]

method=[ memcopy ] gputime=[ 3.488 ]

method=[ memcopy ] gputime=[ 3.808 ]

method=[ memcopy ] gputime=[ 3.680 ]

method=[ memcopy ] gputime=[ 2.880 ]

method=[ F4 ] gputime=[ 3.232 ] cputime=[ 52.000 ] occupancy=[ 1.000 ]

method=[ F3 ] gputime=[ 62.752 ] cputime=[ 84.000 ] occupancy=[ 0.667 ]

method=[ memcopy ] gputime=[ 37.760 ]

method=[ memcopy ] gputime=[ 3.360 ]

method=[ memcopy ] gputime=[ 3.072 ]

method=[ F4 ] gputime=[ 2.880 ] cputime=[ 26.000 ] occupancy=[ 1.000 ]

method=[ F3 ] gputime=[ 60.448 ] cputime=[ 84.000 ] occupancy=[ 0.667 ]

...

Your overheads are typical. Presumably, the cputime overhead includes setting up the argument list, the grid dimensions and passing them to the kernel over PCIe. I’ve noticed in my own testing that binding a texture before a kernel call increases the cputime by another ~40us, so everything associated with binding textures is also part of that.

This overhead is unfortunate, but stays relatively constant as you increase the problem size on the GPU: Even when the GPU takes milliseconds to complete, the cputime overhead is still only microseconds.