iterative methods CUDA implementation details

Hi there,
In the classical GPGPU programming using the OpenGL API, iterative methods are usually solved by a multi-pass algorithms, where one pass=one iteration. However, I’m wondering, whether there is a better way how to handle these methods using CUDA.

For example lets have Poisson’s equation defined on a regular 2D grid and solved by the simple Jacobi iterative method. During every iteration a 5-point stencil is applied on every grid point.
When implementing this method in CUDA, we can divide the grid into a smaller sub-grids and execute blocks of threads on them. So, for example, when we use 16x16 sub-grids (=256 threads per block) we first load a grid data for a given sub-grid from the global or texture memory into the shared memory. We also need to read one border line of grid data for every direction because of the 5-point stencil.
Now, the classical approach would be to compute one iteration and store the result back into the global memory. However to perform the next iteration we only need to update values on the borders and we can reuse other data stored in the shared memory. The problem is that it is impossible to synchronize threads in different blocks and therefore we can’t update the border values via the global memory.

Is there really no way how to perform iterative methods more efficiently or am I missing something?

Pretty much the only option for synchronization between blocks that I have come up with is letting the kernel finish, and then running another one (or the same one again). That’s the only way to be sure everything has been written back to global memory. That might be the same as what you meant by “multi-pass”, though.

yes, that is the mult-pass approach (sorry for the OpenGL terminology :-). The problem with this method is that you have to transfer all processed data from/to global memory while it would be more effective to transfer only the border values and to leave the rest in shared memory.

But it is probably really impossible without global synchronization :(.

Any plans for adding it into some future generation of GPUs? :)

Actually, there’s something better you can do. ScottL from NVIDIA has done a bunch of iterative PDE solver work in CUDA and he gave me some tips to share:

Assuming you don’t need to output the result of every iteration, but could instead settle for the output of every k iterations, there is a simple optimization here that will dramatically improve performance. Instead of reading in an n x n tile into shared
memory, read in an (n + 2k) x (n + 2k) tile into memory. Now perform the first time step on the central (n + 2k - 2) x (n + 2k - 2) neighborhood and write the results back to shared memory (note not global memory!). Next, perform a second time step on the central (n + 2k - 4) x (n + 2k - 4) neighborhood and write that back to shared memory. Repeat this process until you perform the kth time step on the central n x n neighborhood. Write this final n x n neighborhood out to global memory.

Now each thread block can move in a vertical or horizontal swath through the array. Since the thread blocks will now read overlapping data from the input array (there’s a k x m overlap region where m is the entire length or width of the array), you have to double buffer the input/output arrays in global memory or different thread blocks walking adjacent swaths will stomp on each other’s input.

Finally, between n x n tiles within the same swath, one can cache away a 2k x n region in the registers (say 2 cells per thread) and restore that between k-step tile calculations.

We’ve seen 5-10x speedups with this approach.

Mark (and ScottL)

Thanks both you and ScottL for your suggestions. Everything is clear except the last part:

I’m not quite sure whether I understand this correctly. What is not clear to me is how can I cache some data in registers and restore them when executing another block (k-step tile calculation).

I can imagine, that when reading input data from texture memory, these data will be cached in multiprocessor’s texture cache and they can be reused by other blocks executed by this multiprocessor (as one multiprocessor can process multiple blocks concurrently), however I’m not sure how to cache data in the registers. I’ll probably have to reread Programming Guide once more :).

I don’t think “between” means “shared” in this case. I think Scott meant physically between, as in the cells that are overlapped by two swaths.

Mark

Is any of Scotts PDE iterative solver code available to users of CUDA. I am also trying to develope PDE solver. I’m trying to see how I might translate the work of Allen Sanderson for computing solutions for Reactive Flow.

Unfortunately we don’t have it ready for the SDK yet. We’ll get it ported over soon.

Mark

I want to perform PDE solver in 3D. The problem is if i need to perform 4 iteration that mean k = 4, then the maximum size of n that we can chose so that (n+2k)^3 volume fit on the shared mem is 16 - 2 * 4 = 8. That mean we compute the result for the 8^3 = 512 cube but we have to read the volume for 16^3 cube, that is 8 time more expensive. It is worth doing only if i perform 8 iteration instead of 4 (in fact we can not read the 16^3 cube into the memory when the shared limit is 16^3*4 that mean we don’t have share memory for anything else).

Even If i need only 2 iterations, i read the 16^3 = 4096 element volume while i perform calculation for 12^3= 1728 element that is still more than 2 time expensive, in any case this approach can not give any performance gain over the multi-pass one.

So it’d be better if we have

  • more than 16k shared mem

  • and /or have global synchronization between thread blocks.

Is there any other better approach in 3D. Thank you