Hi!
Im implementing an algorithm which computes a 5 point stencil on a vector and then another stencil depending on the results of the first stencil (2nd order RungeKutta method).

Use two kernels for this?

Is there a way to synchronize the workgroups after computation of the first stencil, such that the computation of the 2nd does not start before all results of the first have been written to global memory?

Use only one workgroup with size equal to the number of elements in the vector, such that synchronization can be done with a memory fence?
Or what is the best way to compute this using OpenCL? The overhead of enqueuing a kernel seems pretty large compared to the runtime of my kernel, so launching two seems like madness :) Ultimately I would like to loop the stencil computations like 10 times to save kernel launches, but this again needs synchronization.
Best Regards,
Madsen