I am comparing the performance comparison of a CUDA enabled program with it’s sequential C version.
I used cudaprof to analyze my cuda_profile_log.csv file. I see a comparison of GPU time vs CPU time in microseconds. All functions except Memcopy take less time on GPU as compared to CPU. What do I infer from this? If I total the GPU time and the CPU time and compare them, is that the performance benefit?
I did a cudaEventRecord(start) when I start my CUDA program, compute 14 kernels and then do a cudaEventRecord(stop). I record the elapsed time (using cudaEventElapsedTime), which is again some value in milliseconds(page 23 of Reference Manual). I do a cudaThreadSynchronize after every kernel execution. Does the cudaThreadSynchronize function reset the Record function ? Am I seeing the time taken only for the execution of the last kernel ?
I did a cutResetTimer, cutStartTimer, kernel call and cutStopTimer as well. This measures the time taken to execute each kernel. I finally add up the total time taken by executing all the kernels. Is this the time performance benefit? (Don’t know the time unit here)
In the C program, I just used the clock_t function to time my program ((double)clock()-start)/CLOCKS_PER_SEC.
Also, as per the SDK examples, such as simpleStreams or asyncapi, What’s the time unit? microseconds?
Please let me know how you time the performance benefits of CUDA over sequential programs? Which is the best method?