Interpreting OpenCL Visual Profiler Results

The following references the 2 images linked in this post. I have a question regarding the OpenCL Profiler and I will try to ask this a briefly as possible (I hate to read long posts too).

In the OpenCL Profiler image (attached), for every kernel the CPU time is shorter that the GPU time. According to the OpenCL profiler documentation, this should not be the case, as long as profiling is enabled, which it is (for a blocking kernel call). This appears to be an error–does anyone have any insight?

My second question is that if you look at the time between kernels TNR_filterX and TNR_filterY, the GPU time stamp shows a difference of 5.121 mS. If the kernel execution only took 2.478 mS, that would suggest that the host overhead to call the TNR_filterY kernel took over 2.5 mS!! BTW, the GPU timestamp numbers agree with the OpenCL clGetEventProfilingInfo in the code, so I believe it is correct. This much overhead could not be tolerated in a near real-time application. Am I doing something wrong here? Also, there is no code between kernel calls other than setting kernel arguments and calling clEnqueueNDRangeKernel for the next kernel, so the host should no be busy doing anything that would require a large amount of time. If you look at the CUDA image attached, there is no appreciable delay seen for the same algorithms, so there is definitely a large difference in overall OpenCL and CUDA execution time (host delays + kernel processing time).

Thanks in advance for any help here.
OpenCL_Profile_Output.jpg
CUDA_Profile_Output.jpg

OpenCL_Profile_Output.jpg

CUDA_Profile_Output.jpg

Sorry–the labels did not come through with the post–the first image is the CUDA profiler results and the second image is the OpenCL profiler results.

Would an Nvidia employee be willing to comment? This is an issue in my company that must be understood quickly, as we are currently in product development. If this is not something I am doing wrong, OpenCL does not yet seem mature enough for product deployment (at least for near-real-time or real-time applications).

bump

Nothing!!! Bueller…Bueller???