I’m working on a scientific computing application where the problem size is overall too big to fit on device memory. I thus subdivide the problem domain in to patches, such that two sets input and output buffers required for each patch can fit on the device. This manual effort paves the way to cuncurrentc computation and device-host transfer and retains control of sequencing the orders of patch completion.
The same computation, or kernel is applied to every element of all patches.
Staggered correctly on a compute capability >=2.0 device, the device can be executing the kernel computation on patch A while moving buffers for patch B.
I’m would like clarification on use and re-use of clKernel instances.
Reading the description of clSetKernelArg() suggests that a single kernel instance could be used in the sequence described above. Once the kernel with arguments set for patch A is enqueued, new arguments could be set for patch B and the kernel enqueued again as soon as the patch B buffers are ready to go.
A kernel instance and a kernel execution instance returned by clEnqueueNDRangeKernel are two different things, so this approach should be ok.
Does this sound right?
One source of doubt is found reading the NVIDIA SDK oclCopyComputeOverlap example. Here two kernel instances are created, where the scenario is mostly the same as I described above, I would think one kernel instance could suffice.