Bugs in the profiler 1.0?

Hi, all.

I wonder if the profiler 1.0 has some bugs. I’m not sure it is known or not, but I think the gpu time of a function followed by a cudaMemcpy with a “devicetodevice” option is not measured.

My code looks like this.

kernel_OF_Dp_GPU<<< grid, threads >>>(img1_d_data, img2_d_data, … );
kernel_OF_multi_hypo_constant_find_max_label<<< grid, threads >>>(img1_d_data, Dp_d, … );

for (int itere=0; itere<100; itere++){
kernel_OF_Label_optim<<< grid, threads >>>(label, result_d, … );
cudaMemcpy(label, result_d, sizeof(char) * width * height ,cudaMemcpyDeviceToDevice);
}
kernel_OF_Label2disp<<< grid, threads >>>(label, result_d, …);

In this case, the gpu usec of “kernel_OF_Label_optim” is reported as about 5 usec, which must be the gpu time of cudaMemcpy. And there is no report for the cudaMemcpy(with the devicetodevice option) in the profiling results. The cpu time is rightly reported, so it seems to have a very large cpu overhead (but it’s not true.)

When I commented out the cudaMemcpy, the gpu time for the “kernel_OF_Label_optim” is reported correctly.

Does anybody know the answer for this?

From my experience, whenever you do a memcopy the cpu and gpu must sync, if you are only launching kernels then the cpu piles them up on the gpu but doesn’t wait for them to finish. This sync is very costly. In one of my implementations i have a iterative algorithm, and i found that its faster for me to run it to the max iterations that might be needed instead of testing if the error is small ever once in a while.

So i guess your results can make sense…

sounds like a bug, because the profiler always has implicit threadsynchronize’s. So PM one of the NVIDIA employees with an example.