In addition, I used clock() to measure the clock cycles consumed by different threads of the kernel function. I found all threads span a duration of 22452 cycles while the GT9800 GPU clock rate is 1.35G.
22452/1.35*10^9=16 microseconds, which is close to what I got by profiling.
Now the question is that which timing measurement can be trusted.
Sub-millisecond timing on the CPU is always a big problem… really, it’s just hard to accurately measure such short intervals.
The clock() intrinsic inside kernels is very accurate though, since it’s not measuring a time, it’s measuring a count. Usually this is fine, even preferred, if you’re just benchmarking.
One gotcha: on older GPUs the clock() register is only 32 bits… which means it wraps around after 4 seconds or so. That makes timing kernels of more than a second potentially annoying. I believe in Fermi compute 2.0 there’s both a 32 and 64 bit clock.