The following references the 2 images linked in this post. I have a question regarding the OpenCL Profiler and I will try to ask this a briefly as possible (I hate to read long posts too).
In the OpenCL Profiler image (attached), for every kernel the CPU time is shorter that the GPU time. According to the OpenCL profiler documentation, this should not be the case, as long as profiling is enabled, which it is (for a blocking kernel call). This appears to be an error–does anyone have any insight?
My second question is that if you look at the time between kernels TNR_filterX and TNR_filterY, the GPU time stamp shows a difference of 5.121 mS. If the kernel execution only took 2.478 mS, that would suggest that the host overhead to call the TNR_filterY kernel took over 2.5 mS!! BTW, the GPU timestamp numbers agree with the OpenCL clGetEventProfilingInfo in the code, so I believe it is correct. This much overhead could not be tolerated in a near real-time application. Am I doing something wrong here? Also, there is no code between kernel calls other than setting kernel arguments and calling clEnqueueNDRangeKernel for the next kernel, so the host should no be busy doing anything that would require a large amount of time. If you look at the CUDA image attached, there is no appreciable delay seen for the same algorithms, so there is definitely a large difference in overall OpenCL and CUDA execution time (host delays + kernel processing time).
Thanks in advance for any help here.