In addition, I used clock() to measure the clock cycles consumed by different threads of the kernel function. I found all threads span a duration of 22452 cycles while the GT9800 GPU clock rate is 1.35G.
22452/1.35*10^9=16 microseconds, which is close to what I got by profiling.
Now the question is that which timing measurement can be trusted.
Sub-millisecond timing on the CPU is always a big problem… really, it’s just hard to accurately measure such short intervals.
The clock() intrinsic inside kernels is very accurate though, since it’s not measuring a time, it’s measuring a count. Usually this is fine, even preferred, if you’re just benchmarking.
One gotcha: on older GPUs the clock() register is only 32 bits… which means it wraps around after 4 seconds or so. That makes timing kernels of more than a second potentially annoying. I believe in Fermi compute 2.0 there’s both a 32 and 64 bit clock.
It depends on your OS. In Windows there’s the newer QueryPerformanceCounter() but that certainly won’t be accurate below say 100us, but will be better than other time queries.
In OSX I remember having huge issues getting better than 1/60 second resolution!
Timing on CPUs is always complicated by a mix of both the OS abstractions (especially scheduling). A real time embedded OS tends to be honest and give you whatever the hardware can report.