timing performance of kernels how ? cudaprof vs cudaEventRecord vs cutStartTimer

I am comparing the performance comparison of a CUDA enabled program with it’s sequential C version.

  1. I used cudaprof to analyze my cuda_profile_log.csv file. I see a comparison of GPU time vs CPU time in microseconds. All functions except Memcopy take less time on GPU as compared to CPU. What do I infer from this? If I total the GPU time and the CPU time and compare them, is that the performance benefit?

  2. I did a cudaEventRecord(start) when I start my CUDA program, compute 14 kernels and then do a cudaEventRecord(stop). I record the elapsed time (using cudaEventElapsedTime), which is again some value in milliseconds(page 23 of Reference Manual). I do a cudaThreadSynchronize after every kernel execution. Does the cudaThreadSynchronize function reset the Record function ? Am I seeing the time taken only for the execution of the last kernel ?

  3. I did a cutResetTimer, cutStartTimer, kernel call and cutStopTimer as well. This measures the time taken to execute each kernel. I finally add up the total time taken by executing all the kernels. Is this the time performance benefit? (Don’t know the time unit here)

In the C program, I just used the clock_t function to time my program ((double)clock()-start)/CLOCKS_PER_SEC.
Also, as per the SDK examples, such as simpleStreams or asyncapi, What’s the time unit? microseconds?

Please let me know how you time the performance benefits of CUDA over sequential programs? Which is the best method?

can someone please post how they time their kernels/CUDA programs??

I’m still learning CUDA so I’m not sure how accurate my response is…

I have been using the cut timer functions, which appear to measure time in msec according to the binomialOptions SDK example:

gpuTime = cutGetTimerValue(hTimer);

printf("binomialOptionsGPU() time: %f <b>msec</b>\n", gpuTime);

I do a

cutStartTimer(timer);

before doing my Host → Device memory copy and a

cudaThreadSynchronize();

cutStopTimer(timer);

after the kernel call.

For timing comparison, I also do a

cutStartTimer(timer);

before calling my Host version of the code and a

cutStopTimer(timer);

afterwords.

I’d imagine calculating the sum of the kernel calls would provide accurate timing for speedup comparison with the host code.

so in what conditions should cudaEventRecord be used??

Can someone elaborate ??