Stencil computations

Hi!

Im implementing an algorithm which computes a 5 point stencil on a vector and then another stencil depending on the results of the first stencil (2nd order Runge-Kutta method).

  1. Use two kernels for this?

  2. Is there a way to synchronize the workgroups after computation of the first stencil, such that the computation of the 2nd does not start before all results of the first have been written to global memory?

  3. Use only one workgroup with size equal to the number of elements in the vector, such that synchronization can be done with a memory fence?

Or what is the best way to compute this using OpenCL? The overhead of enqueuing a kernel seems pretty large compared to the runtime of my kernel, so launching two seems like madness :) Ultimately I would like to loop the stencil computations like 10 times to save kernel launches, but this again needs synchronization.

Best Regards,
Madsen

I found the answer in some other therads. This provides insight and links to other threads with the same problem:
http://forums.nvidia.com/lofiversion/index.php?t92819.html

I guess no 1 and 3 are the only possible solutions to my problem, since global synchronization seems out of the picture.