In the classical GPGPU programming using the OpenGL API, iterative methods are usually solved by a multi-pass algorithms, where one pass=one iteration. However, I’m wondering, whether there is a better way how to handle these methods using CUDA.
For example lets have Poisson’s equation defined on a regular 2D grid and solved by the simple Jacobi iterative method. During every iteration a 5-point stencil is applied on every grid point.
When implementing this method in CUDA, we can divide the grid into a smaller sub-grids and execute blocks of threads on them. So, for example, when we use 16x16 sub-grids (=256 threads per block) we first load a grid data for a given sub-grid from the global or texture memory into the shared memory. We also need to read one border line of grid data for every direction because of the 5-point stencil.
Now, the classical approach would be to compute one iteration and store the result back into the global memory. However to perform the next iteration we only need to update values on the borders and we can reuse other data stored in the shared memory. The problem is that it is impossible to synchronize threads in different blocks and therefore we can’t update the border values via the global memory.
Is there really no way how to perform iterative methods more efficiently or am I missing something?