Compare Execution Times CPU vs GPU the proper way?

Im measuring time on some functions writen in C++, with the “clock” structure so i get the dedicated CPU clocks for that portion of code, this way: (demostrative code only, dont look syntax )

double first = clock();

FUNCTION();

double last = clock();

printf("time = %d", (double)(last-first)/CLOCKS_PER_SECOND );

but im wondering how im gonna measure time when the functions get implemented on CUDA, because if i do the same method i should get 0 clocks since there is no CPU work right?

i could use time_t (dates) differences converted to secs instead of clocks. but something tells me that i should ask the experienced people just in case.

what is the best and reccomendable way to do it?

thanks in advance

Cristobal

do the following (Python syntax/Linux since this is how I do it):

t0 = time()

cudaThreadSynchronize()

GPU work (kernels - I/O) goes here

cudaThreadSynchronize()

t1 = time()

print ‘elapsed time’, t1-t0

you can also use cudaEvents, to record the GPU time. This usually gives more or less the same results. Essential is cudaThreadSynchronize, since kernel launches are asynchronous. If you write data to the GPU prior to kernel launch read or results back after it, that will do an implicit synchronization.

The time function in the time module in Python under Linux has a resolution of about one microsecond (essentially the resolution of the underlying Linux timers). If you do this in C, you might not want to use clock, because this measure CPU clock, but gettimeofday. This will get you elapsed wall time which is what matters for all practical estimates.

Also, starting up the GPU takes some time, especially so with code using CUDA runtime, CUBLAS etc, so running a kernel once prior to timing - in some SDK examples called “warmup” - is recommended. Then the actual timing code will be unaffected by this. On my cards startup is usually about 0.3 to 0.6 sec.

thanks, ill make that change then :)

I’m not exactly sure that its a ‘fair game’ to be warming up the GPU before beginning the execution time. If you are doing this timing to report speedups allowed by the GPU compared to the CPU, then you should “record” everything, at least from the point of divergence, including cudaMemcpy’s, and ‘warmup’ kernel runs.

Now, if you just want to know the time, and arent comparing it to CPU code, then do what ever you’d like. I just thought I’d throw my two cents in there for fairness.

what im doing now is using gettimeofday for the CPU functions and cuda timers for the GPU ones, with the threadsyncronize line.

im getting accurate results.

ill have to check those warming up times, at the moment they seem to be very low which is good.
thanks

Actually more accurately would be:

cudaThreadSynchronize()
t0 = time()
GPU work (kernels - I/O) goes here
cudaThreadSynchronize()
t1 = time()
print ‘elapsed time’, t1-t0

Try also running the kernel more times in a loop and averaging elapsed intervals.