reuse of kernel instance for back-to-back sets of work items design to juggle concurrent computation


I’m working on a scientific computing application where the problem size is overall too big to fit on device memory. I thus subdivide the problem domain in to patches, such that two sets input and output buffers required for each patch can fit on the device. This manual effort paves the way to cuncurrentc computation and device-host transfer and retains control of sequencing the orders of patch completion.

The same computation, or kernel is applied to every element of all patches.
Staggered correctly on a compute capability >=2.0 device, the device can be executing the kernel computation on patch A while moving buffers for patch B.

I’m would like clarification on use and re-use of clKernel instances.

Reading the description of clSetKernelArg() suggests that a single kernel instance could be used in the sequence described above. Once the kernel with arguments set for patch A is enqueued, new arguments could be set for patch B and the kernel enqueued again as soon as the patch B buffers are ready to go.
A kernel instance and a kernel execution instance returned by clEnqueueNDRangeKernel are two different things, so this approach should be ok.

Does this sound right?

One source of doubt is found reading the NVIDIA SDK oclCopyComputeOverlap example. Here two kernel instances are created, where the scenario is mostly the same as I described above, I would think one kernel instance could suffice.


I believe that example is just prepared to also execute two kernels (with different arguments) in parallel, and not just overlap copy & compute operations, that’s why. So to the best of my knowledge, if you just want to overlap copy & compute, one kernel instance should suffice.

One thing to be careful about - the spec says that the setArgs function is not thread-safe, even in (the mostly thread-safe) OpenCL 1.1. I would be concerned about the kernel arguments, especially since NVIDIA’s OpenCL is 1.0.