Mesuring Kernel Performance

After finishing my kernel (and all the stuff that supports it) I need to evaluate my its performance by doing a few benchmarks and I was wondering what method do you recommend.

Currently I’m just taking the time the kernel ends minus the time it starts in milliseconds directly in the C++ code. I’m not sure though, that this is the best method. I tried with CUDA Prof, but as far as I can tell, it only gives the timestamps where each CUDA Operation started…

What would you suggest as the best option?

CudaEventRecord(), there are numerous examples in the SDK to use it to do timing of kernel execution.

But are Events the best way to evaluate the overall performance? I mean, right now what I want is to know how long (average) it takes to perform the following code:

copyHostToDevice(host_stream_in, device_stream_in);

kernel<<<1,1>>>(device_stream_in, device_stream_out);

copyDeviceToHost(device_stream_out, host_stream_out);

My doubt is if I should use Events or Wall-Clock time, as although I’m measuring the kernel’s performance, I still need to compare my solution to some similar CPU based one.

I usually using CUDA profiler to measuring executed time. In the reports (*.csv) file, you need to pay attention at the tree columns “timestamp”, “gputime” and “cputime” (microsecond).