Adapting Recursive and Block Algorithms for GPU Memory Hierarchy and Performance Optimization

Here I am talking about the Gaussian Elimination algorithm, but it is also good to know the general behaviour.

On CPUs, a recursive algorithm typically reduces the problem size recursively until it fits into progressively smaller cache levels (L3, L2, L1). Another approach is the block algorithm, which processes data in fixed-size blocks.

How do these approaches translate to the GPU world, where the memory hierarchy includes global memory, shared memory, and registers? Specifically, how do recursive and block algorithms differ in their use of GPU memory and performance optimization? Are there any references or resources that provide a detailed examination of these two approaches on GPUs?

Which one is preferred in the GPU world?

On the GPU a highly optimized algorithm would typically not use recursion directly, at least not for smaller problem sizes.

Recursion prevents the effective use of a large register set and writing values to a stack in local memory (even if it is covered by L1 caches) often slows down the code.

There are different techniques for optimizing for certain problem sizes, so you would create fixed functions for problem size 1, 2, 4, 8, 16, 32, 64, 128, 256, …, which can (but only sometimes do) in turn call (or inline) the next smaller ones.
For those problem sizes, special ways of synchronization and optimization on thread, warp, block and grid-level can be applied.

Both that set of fixed-problem-size functions and block algorithms can be found in the GPU world a lot.
(You can find recursive algorithms, too, of course.)

Also algorithms would more often work out-of-place then in-place, if possible, as it is more similar to a streaming model, where some data is read-only, and other data is write-once, which simplifies concurrency and caching (The two main exceptions are data structures, which have to be changed multiple times or read back by the current stage of the algorithm and large datasets, where memory size is an issue).

1 Like

@Curefab Thank you very much. Do you have any references for the part you mentioned? or more information that comes from some evidence? or how is this happening?

Recursion prevents the effective use of a large register set and writing values to a stack in local memory (even if it is covered by L1 caches) often slows down the code.

Another question concerns the recursive block algorithm. What is the meaning of this type? something mixed?

While not directly addressing your query, this blog illustrates a high performance approach to proceeding through the heirachy efficiently.

I do not have any documentation, scientific or forum references at hand.

A good avenue is looking for highly optimized libraries like linear algebra or FFT. Nvidia provides device-side versions, which are available as source code. See for example cuFFTDx Download | NVIDIA Developer