time measurement discrepancy timer, clock(), profiling

nzhang · April 7, 2010, 5:23pm

I used two different approaches to measure the time duration of a kernel function. They yield quick different results:
Method 1: use cuda timer

unsigned int timer = 0;
cutilCheckError( cutCreateTimer( &timer));

// Timer starts
cudaThreadSynchronize();
cutilCheckError( cutStartTimer( timer));

/* function here */

cudaThreadSynchronize();
cutilCheckError( cutStopTimer( timer));
printf( "Processing time: %f (ms)\n", cutGetTimerValue( timer));
cutilCheckError( cutDeleteTimer( timer));

the result is 0.29ms. By the way, the results of timer are in the time unit of millisecond, correct?

Method 2. Use cuda profiling
Set environment variable
CUDA_PROFILE=1
CUDA_PROFILE_CSV=1
CUDA_PROFILE_CONFIG=./CUDA_PROFILE_CONFIG

For the kernel function,
gputime=23.136 microseconds
cputime=37.318 microseconds

I am wondering why results of these two methods are so different.
Thanks a lot!

nzhang · April 7, 2010, 5:48pm

I use cudaEvent approach suggested by
[url=“Best Open Source Mac Software 2022”]Best Open Source Mac Software 2022

The code is:
cudaEvent_t start,stop;
cudaEventCreate( &start );
cudaEventCreate( &stop );
float et;

cudaEventRecord(start,0);

cholRpKernel4t<<<NRP / CHOL_SIZE, CHOL_SIZE>>>(d_a, d_b, d_x, dtimer);

cudaEventRecord(stop,0 );
cudaEventSynchronize(stop);
cudaEventElapsedTime( &et, start, stop );

printf("elapsed time: %f milliseconds (ms) \n", et);

The result is: 0.095millisecond

In addition, I used clock() to measure the clock cycles consumed by different threads of the kernel function. I found all threads span a duration of 22452 cycles while the GT9800 GPU clock rate is 1.35G.
22452/1.35*10^9=16 microseconds, which is close to what I got by profiling.
Now the question is that which timing measurement can be trusted.

SPWorley · April 7, 2010, 6:03pm

Sub-millisecond timing on the CPU is always a big problem… really, it’s just hard to accurately measure such short intervals.
The clock() intrinsic inside kernels is very accurate though, since it’s not measuring a time, it’s measuring a count. Usually this is fine, even preferred, if you’re just benchmarking.

One gotcha: on older GPUs the clock() register is only 32 bits… which means it wraps around after 4 seconds or so. That makes timing kernels of more than a second potentially annoying. I believe in Fermi compute 2.0 there’s both a 32 and 64 bit clock.

nzhang · April 7, 2010, 6:09pm

Thank you for the reply!

For sub-millisecond timing, how well will other approaches behave such as profiling method (set CUDA_PROFILE=1) and GPU timer (cudaEventCreate)?

SPWorley · April 7, 2010, 6:19pm

It depends on your OS. In Windows there’s the newer QueryPerformanceCounter() but that certainly won’t be accurate below say 100us, but will be better than other time queries.

In OSX I remember having huge issues getting better than 1/60 second resolution!

Timing on CPUs is always complicated by a mix of both the OS abstractions (especially scheduling). A real time embedded OS tends to be honest and give you whatever the hardware can report.

You can try to hit the CPU counters directly with the x86 RTDSC instruction… but even that isn’t stable because of the OS scheduling interference.

Topic		Replies	Views
CPU vs GPU Timer Is CUDA Timer accurate ? CUDA Programming and Performance	3	6878	February 19, 2010
On timing and timer CUDA Programming and Performance	7	4286	July 15, 2009
clock() doesn't work properly CUDA Programming and Performance	10	6446	July 3, 2009
timing kernel execution with clock() CUDA Programming and Performance	6	3840	July 6, 2009
timing and the profiler getting different results from each CUDA Programming and Performance	10	1913	February 3, 2010
Events vs Timers - big differences measurung kernel execution time CUDA Programming and Performance	0	3851	December 20, 2010
Compare GPU and CPU function time CUDA Programming and Performance	7	6444	May 30, 2011
Number of GPU clock cycles CUDA Programming and Performance	15	10787	June 16, 2017
Compare Execution Times CPU vs GPU the proper way? CUDA Programming and Performance	5	6257	September 8, 2009
How to measure time in cuda kernel ...? [CUDA 4.0] CUDA Programming and Performance	2	1332	May 7, 2013

time measurement discrepancy timer, clock(), profiling

Related topics