time measurement discrepancy timer, clock(), profiling

I used two different approaches to measure the time duration of a kernel function. They yield quick different results:
Method 1: use cuda timer

unsigned int timer = 0;
cutilCheckError( cutCreateTimer( &timer));

// Timer starts
cudaThreadSynchronize();
cutilCheckError( cutStartTimer( timer));

/* function here */

cudaThreadSynchronize();
cutilCheckError( cutStopTimer( timer));
printf( "Processing time: %f (ms)\n", cutGetTimerValue( timer));
cutilCheckError( cutDeleteTimer( timer));	

the result is 0.29ms. By the way, the results of timer are in the time unit of millisecond, correct?

Method 2. Use cuda profiling
Set environment variable
CUDA_PROFILE=1
CUDA_PROFILE_CSV=1
CUDA_PROFILE_CONFIG=./CUDA_PROFILE_CONFIG

For the kernel function,
gputime=23.136 microseconds
cputime=37.318 microseconds

I am wondering why results of these two methods are so different.
Thanks a lot!

I use cudaEvent approach suggested by
http://openvidia.sourceforge.net/index.php/Tips_and_Tricks

The code is:
cudaEvent_t start,stop;
cudaEventCreate( &start );
cudaEventCreate( &stop );
float et;

cudaEventRecord(start,0);

cholRpKernel4t<<<NRP / CHOL_SIZE, CHOL_SIZE>>>(d_a, d_b, d_x, dtimer);

cudaEventRecord(stop,0 );
cudaEventSynchronize(stop);
cudaEventElapsedTime( &et, start, stop );

printf("elapsed time: %f milliseconds (ms) \n", et);

The result is: 0.095millisecond

In addition, I used clock() to measure the clock cycles consumed by different threads of the kernel function. I found all threads span a duration of 22452 cycles while the GT9800 GPU clock rate is 1.35G.
22452/1.35*10^9=16 microseconds, which is close to what I got by profiling.
Now the question is that which timing measurement can be trusted.

Sub-millisecond timing on the CPU is always a big problem… really, it’s just hard to accurately measure such short intervals.
The clock() intrinsic inside kernels is very accurate though, since it’s not measuring a time, it’s measuring a count. Usually this is fine, even preferred, if you’re just benchmarking.

One gotcha: on older GPUs the clock() register is only 32 bits… which means it wraps around after 4 seconds or so. That makes timing kernels of more than a second potentially annoying. I believe in Fermi compute 2.0 there’s both a 32 and 64 bit clock.

Thank you for the reply!

For sub-millisecond timing, how well will other approaches behave such as profiling method (set CUDA_PROFILE=1) and GPU timer (cudaEventCreate)?

It depends on your OS. In Windows there’s the newer QueryPerformanceCounter() but that certainly won’t be accurate below say 100us, but will be better than other time queries.

In OSX I remember having huge issues getting better than 1/60 second resolution!

Timing on CPUs is always complicated by a mix of both the OS abstractions (especially scheduling). A real time embedded OS tends to be honest and give you whatever the hardware can report.

You can try to hit the CPU counters directly with the x86 RTDSC instruction… but even that isn’t stable because of the OS scheduling interference.