How expensive is calling clEnqueueNDRangeKernel?

Question #1

I have a function Run() that calls execution of two kernels:

// As you see, I’m using events (eventRow, eventCol) because of profiling.

How expensive (time performance) is calling enqueueNDRangeKernel (or clEnqueueNDRangeKernel ).

With Nvidia OpenCL Profiler, I got total time of execution (on GPU) 351 ms, but when I measured time of running of method Run()

I got 622 ms.

Why this difference is so large?

When is data transfered to GPU, on calling clEnqueueNDRangeKernel or when buffer is created (clCreateBuffer)?

I tested on NVIDIA GT240.

I also tested on ATI HD 5670 and difference is much smaller.


I measured the overhead in launching my kernel to be 41um here:

This is of cause just on my machine, but it means that even though my calculations run much faster on the GPU, I can’t benefit since the overhead is killing the execution.