First venture into CUDA (1.1; Linux). Still naive. (a) Does not crash, (b) results are correct, and © occasionally faster than CPU. But those aside, I was looking at the profile log. Manual says that cputime includes the gputime of the kernel. But what else does it include? Preceding mallocs?memcopies? Or more importantly, is there anything that I could do in order to squeeze the “overhead”?
Curiously, the first call to “F3” (and “F4”) has more overhead than later calls. Data/block/thread size is same on those calls.
method=[ memcopy ] gputime=[ 3.776 ]
method=[ F1 ] gputime=[ 64.000 ] cputime=[ 124.000 ] occupancy=[ 1.000 ]
method=[ F2 ] gputime=[ 7.936 ] cputime=[ 64.000 ] occupancy=[ 1.000 ]
method=[ memcopy ] gputime=[ 2.944 ]
method=[ memcopy ] gputime=[ 3.712 ]
method=[ F1 ] gputime=[ 49.376 ] cputime=[ 100.000 ] occupancy=[ 1.000 ]
method=[ F2 ] gputime=[ 7.264 ] cputime=[ 52.000 ] occupancy=[ 1.000 ]
method=[ memcopy ] gputime=[ 2.816 ]
method=[ memcopy ] gputime=[ 4.480 ]
method=[ memcopy ] gputime=[ 3.872 ]
method=[ F3 ] gputime=[ 62.368 ] cputime=[ 113.000 ] occupancy=[ 0.667 ]
method=[ memcopy ] gputime=[ 38.272 ]
method=[ memcopy ] gputime=[ 3.488 ]
method=[ memcopy ] gputime=[ 3.808 ]
method=[ memcopy ] gputime=[ 3.680 ]
method=[ memcopy ] gputime=[ 2.880 ]
method=[ F4 ] gputime=[ 3.232 ] cputime=[ 52.000 ] occupancy=[ 1.000 ]
method=[ F3 ] gputime=[ 62.752 ] cputime=[ 84.000 ] occupancy=[ 0.667 ]
method=[ memcopy ] gputime=[ 37.760 ]
method=[ memcopy ] gputime=[ 3.360 ]
method=[ memcopy ] gputime=[ 3.072 ]
method=[ F4 ] gputime=[ 2.880 ] cputime=[ 26.000 ] occupancy=[ 1.000 ]
method=[ F3 ] gputime=[ 60.448 ] cputime=[ 84.000 ] occupancy=[ 0.667 ]
...