timing performance of kernels how ? cudaprof vs cudaEventRecord vs cutStartTimer

vivekv80 · March 11, 2009, 1:26pm

I am comparing the performance comparison of a CUDA enabled program with it’s sequential C version.

I used cudaprof to analyze my cuda_profile_log.csv file. I see a comparison of GPU time vs CPU time in microseconds. All functions except Memcopy take less time on GPU as compared to CPU. What do I infer from this? If I total the GPU time and the CPU time and compare them, is that the performance benefit?
I did a cudaEventRecord(start) when I start my CUDA program, compute 14 kernels and then do a cudaEventRecord(stop). I record the elapsed time (using cudaEventElapsedTime), which is again some value in milliseconds(page 23 of Reference Manual). I do a cudaThreadSynchronize after every kernel execution. Does the cudaThreadSynchronize function reset the Record function ? Am I seeing the time taken only for the execution of the last kernel ?
I did a cutResetTimer, cutStartTimer, kernel call and cutStopTimer as well. This measures the time taken to execute each kernel. I finally add up the total time taken by executing all the kernels. Is this the time performance benefit? (Don’t know the time unit here)

In the C program, I just used the clock_t function to time my program ((double)clock()-start)/CLOCKS_PER_SEC.
Also, as per the SDK examples, such as simpleStreams or asyncapi, What’s the time unit? microseconds?

Please let me know how you time the performance benefits of CUDA over sequential programs? Which is the best method?

vivekv80 · March 13, 2009, 2:38pm

can someone please post how they time their kernels/CUDA programs??

jph4599 · March 13, 2009, 3:07pm

I’m still learning CUDA so I’m not sure how accurate my response is…

I have been using the cut timer functions, which appear to measure time in msec according to the binomialOptions SDK example:

gpuTime = cutGetTimerValue(hTimer);

printf("binomialOptionsGPU() time: %f <b>msec</b>\n", gpuTime);

I do a

cutStartTimer(timer);

before doing my Host → Device memory copy and a

cudaThreadSynchronize();

cutStopTimer(timer);

after the kernel call.

For timing comparison, I also do a

cutStartTimer(timer);

before calling my Host version of the code and a

cutStopTimer(timer);

afterwords.

I’d imagine calculating the sum of the kernel calls would provide accurate timing for speedup comparison with the host code.

vivekv80 · March 21, 2009, 1:27pm

so in what conditions should cudaEventRecord be used??

Can someone elaborate ??

Topic		Replies	Views
how to evaluate the CUDA's performance how can i know the program is optimazed CUDA Programming and Performance	7	7338	July 24, 2008
Semantics of recording a cudaEvent \| Accuracy of cudaEvents Vs nvprof CUDA Programming and Performance	0	586	August 13, 2017
Mesuring Kernel Performance CUDA Programming and Performance	3	1081	September 29, 2009
CUDA event timer or C++11 <chrono> timers, which one should I use? CUDA Programming and Performance	4	4013	May 21, 2019
CUDA OpenCL comparison CUDA Programming and Performance	9	3402	August 23, 2011
time measurement discrepancy timer, clock(), profiling CUDA Programming and Performance	4	6696	April 7, 2010
cudaLaunchHostFunc + cudaEventElapsedTime? CUDA Programming and Performance	4	862	August 3, 2022
Getting different time for kernel execution. CUDA Programming and Performance	6	5901	July 30, 2009
Different times Ubuntu Vs Windows CUDA Programming and Performance	8	1678	October 12, 2015
Compare Execution Times CPU vs GPU the proper way? CUDA Programming and Performance	5	5999	September 8, 2009

timing performance of kernels how ? cudaprof vs cudaEventRecord vs cutStartTimer

Related topics