Adapting Recursive and Block Algorithms for GPU Memory Hierarchy and Performance Optimization

Here I am talking about the Gaussian Elimination algorithm, but it is also good to know the general behaviour.

On CPUs, a recursive algorithm typically reduces the problem size recursively until it fits into progressively smaller cache levels (L3, L2, L1). Another approach is the block algorithm, which processes data in fixed-size blocks.

How do these approaches translate to the GPU world, where the memory hierarchy includes global memory, shared memory, and registers? Specifically, how do recursive and block algorithms differ in their use of GPU memory and performance optimization? Are there any references or resources that provide a detailed examination of these two approaches on GPUs?

Which one is preferred in the GPU world?

On the GPU a highly optimized algorithm would typically not use recursion directly, at least not for smaller problem sizes.

Recursion prevents the effective use of a large register set and writing values to a stack in local memory (even if it is covered by L1 caches) often slows down the code.

There are different techniques for optimizing for certain problem sizes, so you would create fixed functions for problem size 1, 2, 4, 8, 16, 32, 64, 128, 256, …, which can (but only sometimes do) in turn call (or inline) the next smaller ones.
For those problem sizes, special ways of synchronization and optimization on thread, warp, block and grid-level can be applied.

Both that set of fixed-problem-size functions and block algorithms can be found in the GPU world a lot.
(You can find recursive algorithms, too, of course.)

Also algorithms would more often work out-of-place then in-place, if possible, as it is more similar to a streaming model, where some data is read-only, and other data is write-once, which simplifies concurrency and caching (The two main exceptions are data structures, which have to be changed multiple times or read back by the current stage of the algorithm and large datasets, where memory size is an issue).

1 Like

@Curefab Thank you very much. Do you have any references for the part you mentioned? or more information that comes from some evidence? or how is this happening?

Recursion prevents the effective use of a large register set and writing values to a stack in local memory (even if it is covered by L1 caches) often slows down the code.

Another question concerns the recursive block algorithm. What is the meaning of this type? something mixed?

While not directly addressing your query, this blog illustrates a high performance approach to proceeding through the heirachy efficiently.

I do not have any documentation, scientific or forum references at hand.

A good avenue is looking for highly optimized libraries like linear algebra or FFT. Nvidia provides device-side versions, which are available as source code. See for example cuFFTDx Download | NVIDIA Developer

Could you please explain more about what you mentioned regarding the use of registers and local memory in recursion? or an idea to test it to provide some statistics? I did not understand the behaviour.

In Cuda SASS (and many other assembler or machine languages) all registers are directly chosen in the instruction word.

If a device function is called (instead of being inlined at compile-time), meaning the Cuda thread jumps to the first instruction of the called function, then each local C++ variable, which would be stored in a register, is always stored in the same register for each level of recursion.

When the function is called inside of itself, it has to save and later restore the registers somewhere. Typically a stack in local memory.

Whereas, if the recursion is unrolled at compile-time, different registers can be used for each level of recursion and saving and restoring registers is not necessary.

Normally, Cuda device functions are always inlined; but there are some exceptions for code from different translation units with certain options and also for recursive functions.

Thank you for the information.

Can you explain what happens in a hybrid configuration where a CUDA kernel is launched in a algorithm implemented in 2 versions of recursive and block and implemented within the CPU? Specifically, I am curious about the outcome when the final stage of recursive or block algorithm is a CUDA kernel, while the rest of the code is in C for CPU. Imagine the final stage kernel is the same and the size is 32x32.

I am not sure if in that case algorithms are sensitive to the CPU memory stack, since I am seeing a queue for lunch of kernels in both cases and GPU is performing tasks. So why recursive is slower?

Hi uniadam,
that depends of course.

Okay, assuming the final stage or the inner stage of the algorithm is a CUDA kernel.
If you launch only a small work package then it cannot be well parallelized.
Try to collect many inner stages in a list / array and launch all of them at once.
This is one way to do it, and it is often done fully on GPU.
E.g. by implementing the algorithm as two kernels, the first one collects the work and the second one calls it.

With recursive algorithms one work package creates the next, so the collection is not independent from the calculation.
Therefore the natural aim is to change the recursive algorithm either in one, which is unrolled or one, which is changed into a loop and run it on GPU.

If the number of recursions is not known in advance, you can unroll part of it, e.g.

f(f(f(f(f(f(f(...)))))))))
In such a schematic recursive function, you can create a function g, which unrolls f 4 times

g=f(f(f(f()))

e.g. for 27 recursions of f
call f(f(f(g(g(g(g(g(g(x)))))) // 27 = 3 + 6 * 4

You reduced the function calls / recursions from 27 to 9

In most cases, it is even better to convert recursion into a loop (which can also be unrolled partially)

Thank you. Actually, it has 3 or 4 stages.

In most cases, it is even better to convert recursion into a loop (which can also be unrolled partially)

So in that hybrid configuration, what is the reason for lower performance?

For a more and more complex algorithm, it gets more and more difficult and fuzzy to just discuss general rules. Depending on, where the bottlenecks are or how the algorithm can be reformulated, sometimes non-obvious measures can lead to good results.

E.g. for a kernel limited by global memory bandwidth, I separated into several stages, creating intermediate results, which were much larger in size than the original memory area.
Nevertheless that lead to a huge speed-up. Why? Because with clever ordering of memory accesses to threads, warps and blocks, the L1 and L2 cache hit rate could be increased a lot compared to a single kernel.
That was an optimization by design. One really has to understand the bottlenecks of the code and examine possible variants.

It is really difficult to create a set of fixed rules, one can rather give examples or best practices.

For the hybrid configuration I mentioned, what could be the reason for lower performance, I repeat and add some others:

  • if the CPU only gives a small workload within the inner loop to the GPU, then it cannot be well parallelized
  • moving data between device memory and system memory is slow
  • the CPU part perhaps are slower than what can be achieved on the GPU alone
  • launching GPU kernels has some latency. A CPU more used to high single-thread speed than high bandwidth parallelization could be slowed down
  • Creating fast CPU algorithms need AVX (x64) or Neon (ARM) vector instructions including changing conditional code into arithmetic branchless code; you cannot be sure that the optimizer does that for you
  • Even on CPU doing recursion is not the best way (although it is not as bad as on GPU)

Recursion in general is nice to formulate algorithms in. It is especially widespread in functional languages.

But even functional language interpreters (and even more so compilers) try to translate recursion into simpler constructs. E.g. look at Tail call - Wikipedia

Thank you. Just for an example please look at this code, imagine the block size is equal. In this code, the CPU just calls kernels and does not do any computation.

``/*
   -- MAGMA (version 2.7.2) --
   Univ. of Tennessee, Knoxville
   Univ. of California, Berkeley
   Univ. of Colorado, Denver
   @date August 2023

   @author Azzam Haidar
   @author Tingxing Dong

   @generated from src/zgetf2_native.cpp, normal z -> d, Fri Aug 25 13:16:49 2023
*/
#include "magma_internal.h"
#include "batched_kernel_param.h"

#define dA(i, j)  (dA + (i) + (j)*ldda)
#define PARSWAP

magma_int_t
magma_dgetf2_native_blocked(
    magma_int_t m, magma_int_t n,
    magmaDouble_ptr dA, magma_int_t ldda,
    magma_int_t *dipiv, magma_int_t *dinfo,
    magma_int_t gbstep, magma_queue_t queue)
{
    magma_int_t arginfo = 0;
    if (m < 0) {
        arginfo = -1;
    } else if (n < 0 ) {
        arginfo = -2;
    } else if (ldda < max(1,m)) {
        arginfo = -4;
    }

    if (arginfo != 0) {
        magma_xerbla( __func__, -(arginfo) );
        return arginfo;
    }

    // Quick return if possible
    if (m == 0 || n == 0) {
        return arginfo;
    }

    double c_neg_one = MAGMA_D_NEG_ONE;
    double c_one     = MAGMA_D_ONE;
    magma_int_t nb = BATF2_NB;

    magma_int_t min_mn = min(m, n);
    magma_int_t gbj, j, step, ib;

    for( j=0; j < min_mn; j += nb){
        ib = min(nb, min_mn-j);
        for (step=0; step < ib; step++) {
            gbj = j+step;
            // find the max of the column gbj
            arginfo = magma_idamax_native( m-gbj, dA(gbj,gbj), 1, dipiv+gbj, dinfo, gbj, gbstep, queue);

            if (arginfo != 0 ) return arginfo;
            // Apply the interchange to columns 1:N. swap the whole row
            magma_dswap_native(n, dA, ldda, gbj, dipiv, queue);
            if (arginfo != 0 ) return arginfo;
            // Compute elements J+1:M of J-th column.
            if (gbj < m) {
                arginfo = magma_dscal_dger_native( m-gbj, ib-step, dA(gbj,gbj), ldda, dinfo, gbj, gbstep, queue );
                if (arginfo != 0 ) return arginfo;
            }
        }

        if ( (n-j-ib) > 0) {
            magma_dgetf2trsm_2d_native( ib, n-j-ib,
                                        dA(j,j   ), ldda,
                                        dA(j,j+ib), ldda, queue);

            magma_dgemm( MagmaNoTrans, MagmaNoTrans,
                         m-(j+ib), n-(j+ib), ib,
                         c_neg_one, dA(ib+j, j   ), ldda,
                                    dA(j   , ib+j), ldda,
                         c_one,     dA(ib+j, ib+j), ldda,
                                    queue);
        }
    }

    return 0;
}

magma_int_t
magma_dgetf2_native_recursive(
    magma_int_t m, magma_int_t n,
    magmaDouble_ptr dA, magma_int_t ldda,
    magma_int_t *dipiv, magma_int_t *dipivinfo,
    magma_int_t *dinfo, magma_int_t gbstep,
    magma_event_t events[2], magma_queue_t queue_0, magma_queue_t queue_1)
{
    magma_int_t arginfo = 0;
    if (m < 0 || m > DGETF2_FUSED_MAX_M) {
        arginfo = -1;
    } else if (n < 0 ) {
        arginfo = -2;
    } else if (ldda < max(1,m)) {
        arginfo = -4;
    }

    if (arginfo != 0) {
        magma_xerbla( __func__, -(arginfo) );
        return arginfo;
    }

    // Quick return if possible
    if (m == 0 || n == 0) {
        return arginfo;
    }

    //magma_event_t e[2];
    //magma_event_create( &e[0] );
    //magma_event_create( &e[1] );

    magma_int_t nb;
    magma_int_t sm_count = magma_getdevice_multiprocessor_count();
    if     (sm_count >= 32){nb = 32;}
    else if(sm_count >= 16){nb = 16;}
    else if(sm_count >=  8){nb =  8;}
    else if(sm_count >=  4){nb =  4;}
    else if(sm_count >=  2){nb =  2;}
    else                   {nb =  1;}

    if( n <= nb){
        magma_int_t* update_flags = dipivinfo;
        // wait for all kernels in the update queue to end before calling the panel kernel
        magma_event_record( events[0], queue_1 );
        magma_queue_wait_event( queue_0, events[0] );
        magma_dgetf2_native_fused( m, n, dA(0,0), ldda, dipiv, gbstep, update_flags, dinfo, queue_0 );
        magma_event_record( events[1], queue_0 );
        magma_queue_wait_event( queue_1, events[1] );
        return 0;
    }
    else{
        magma_int_t n1 = n / 2;
        magma_int_t n2 = n - n1;

        // lu on A1
        magma_dgetf2_native_recursive(m, n1, dA(0,0), ldda, dipiv, dipivinfo, dinfo, gbstep, events, queue_0, queue_1 );

        // swap left
        #ifdef PARSWAP
        setup_pivinfo( dipivinfo, dipiv, m, n1, queue_0);  // setup pivinfo
        magma_dlaswp_rowparallel_native( n2, dA(0,n1), ldda, dA(0,n1), ldda, 0, n1, dipivinfo, queue_0);
        #else
        magma_dlaswp_rowserial_native(n2, dA(0,n1), ldda, 0, n1, dipiv, queue_0);
        #endif

        // update (trsm + gemm)
        magma_dgetf2trsm_2d_native( n1, n2,
                                    dA(0,0), ldda,
                                    dA(0,n1), ldda, queue_0);

        magma_dgemm( MagmaNoTrans, MagmaNoTrans,
                     m-n1, n2, n1,
                     MAGMA_D_NEG_ONE, dA(n1,  0), ldda,
                                      dA(0 , n1), ldda,
                     MAGMA_D_ONE,     dA(n1, n1), ldda, queue_0 );

        // lu on A2
        magma_dgetf2_native_recursive(m-n1, n2, dA(n1,n1), ldda, dipiv+n1, dipivinfo, dinfo, gbstep, events, queue_0, queue_1 );

        // swap right: if PARSWAP is set, we need to call setup_pivinfo
        #ifdef PARSWAP
        setup_pivinfo( dipivinfo, dipiv+n1, m-n1, n2, queue_0);  // setup pivinfo
        #endif

        adjust_ipiv( dipiv+n1, n2, n1, queue_0);

        #ifdef PARSWAP
        magma_dlaswp_rowparallel_native(n1, dA(n1,0), ldda, dA(n1,0), ldda, n1, n, dipivinfo, queue_0);
        #else
        magma_dlaswp_rowserial_native(n1, dA(0,0), ldda, n1, n, dipiv, queue_0);
        #endif
    }

    //magma_event_destroy( e[0] );
    //magma_event_destroy( e[1] );

    return 0;
}
/***************************************************************************//**
    Purpose
    -------
    DGETF2 computes an LU factorization of a general M-by-N matrix A
    using partial pivoting with row interchanges.

    The factorization has the form
        A = P * L * U
    where P is a permutation matrix, L is lower triangular with unit
    diagonal elements (lower trapezoidal if m > n), and U is upper
    triangular (upper trapezoidal if m < n).

    This is the right-looking Level 3 BLAS version of the algorithm.

    This is a GPU-only routine. The host CPU is not used.

    Arguments
    ---------
    @param[in]
    m       INTEGER
            The number of rows of each matrix A.  M >= 0.

    @param[in]
    n       INTEGER
            The number of columns of each matrix A.  N >= 0.

    @param[in,out]
    dA      A DOUBLE PRECISION array on the GPU, dimension (LDDA,N).
            On entry, an M-by-N matrix to be factored.
            On exit, the factors L and U from the factorization
            A = P*L*U; the unit diagonal elements of L are not stored.

    @param[in]
    ldda    INTEGER
            The leading dimension of A.  LDDA >= max(1,M).

    @param[out]
    dipiv   An INTEGER array, dimension (min(M,N))
            The pivot indices; for 1 <= i <= min(M,N), row i of the
            matrix was interchanged with row IPIV(i).

    @param[out]
    dipivinfo  An INTEGER array, for internal use.

    @param[out]
    dinfo         INTEGER, stored on the GPU
      -     = 0:  successful exit
      -     < 0:  if INFO = -i, the i-th argument had an illegal value
                  or another error occured, such as memory allocation failed.
      -     > 0:  if INFO = i, U(i,i) is exactly zero. The factorization
                  has been completed, but the factor U is exactly
                  singular, and division by zero will occur if it is used
                  to solve a system of equations.

    @param[in]
    gbstep  INTEGER
            internal use.

    @param[in]
    queues  magma_queue_t array of size 2.
            Queues to execute in.

    @param[in]
    events  magma_event_t array of size 2
            Internal use.

    This is an internal routine.

    @ingroup magma_getf2_batched
*******************************************************************************/
extern "C" magma_int_t
magma_dgetf2_native(
    magma_int_t m, magma_int_t n,
    magmaDouble_ptr dA, magma_int_t ldda,
    magma_int_t *dipiv, magma_int_t* dipivinfo,
    magma_int_t *dinfo, magma_int_t gbstep,
    magma_event_t events[2], magma_queue_t queue, magma_queue_t update_queue)
{
    #if defined(MAGMA_HAVE_CUDA) || defined(MAGMA_HAVE_HIP)
    magma_int_t arch = magma_getdevice_arch();
    if(m > DGETF2_FUSED_MAX_M || arch < 300){
      magma_dgetf2_native_blocked(m, n, dA, ldda, dipiv, dinfo, gbstep, queue);
    }
    else{
      magma_dgetf2_native_recursive(m, n, dA, ldda, dipiv, dipivinfo, dinfo, gbstep, events, queue, update_queue);
    }
    #else
    magma_dgetf2_native_blocked(m, n, dA, ldda, dipiv, dinfo, gbstep, queue);
    #endif    // MAGMA_HAVE_CUDA
    return 0;
}
`

If your matrices have 1 million rows or more, you could do a hybrid approach with the CPU calling GPU functions to find the max of a column or swap columns.

Otherwise you should try to parallelize the outer loops, e.g. send 1000 columns with 1000 rows each to one kernel, which returns 1000 max values (one for each column).

Or compute the same algorithm for 1000 independent matrices.

Or you find a parallel formulation of the overall task. That is not necessarily easy.

@uniadam

Please use proper code formatting. Please edit your most recent post. A typical method:

  1. edit the post by selecting the pencil icon beneath
  2. select the code
  3. press the </> button
  4. save your changes

Please do it now.

What do you think about the following items as reasons for better-blocked algorithm performance in a hybrid mode with an equal CUDA kernel that is launched by CPU?

Memory Access Patterns: Blocked methods often optimize memory access patterns more effectively. If the memory access patterns are better aligned with GPU architecture, this can lead to more efficient computation.
Kernel Synchronization: If the recursive method requires more frequent synchronization points or intra-kernel synchronization, this can introduce additional latency.
Simpler Execution: Reduced complexity and synchronization requirements can lead to smoother execution.

Sorry, now fixed.

You are very much dealing in hypotheticals.

I think all three are valid reasons, but I would see neither as the existing main reasons (which I mentioned in the posts before), which make hybrid algorithms or recursive algorithms especially difficult for high performance.

But as to your three listed reasons:

  • You can try to optimize memory access patterns of recursive algorithms, you can even have an iteration counter as parameter and reserved buffer memory or memory for input and output only for that recursive iteration. So with enough optimization, memory accesses could be similar. The block-based algorithm could be optimized for its block size, but so can recursive algorithms, if you make the iteration a template parameter. But I admit, that it would be a bit simpler to optimize with (fixed) block-based algorithms.

  • The listed arguments against intra-kernel synchronization are rather general arguments against hybrid mode.

  • Simpler Execution: I would rather say that an algorithm can be formulated in a simple way as recursive algorithm or as block based algorithm.

Are you trying to understand, or are you writing a paper or are you trying to find out the best way to go?
In each of the three cases I would recommend to you to try to optimize different versions / paradigms. You would get the most out of it.

@Curefab Thank you. I am trying to understand and write a paper. And it seems that the more I understand, the more I can change the algorithms.

There are no original first principles, which kind of algorithms are faster. There are only best practices.

Whether a certain approach works, depends a lot on the skills of the implementer, the algorithm, the data size and to only a bit lesser degree on the architecture version and concrete GPU.

Performance could be bandwidth, latency. You could need to keep allocated memory in check or not, and there could be other kernels running at the same time, which need PCIe or memory bandwidth, shared memory space or certain computation units only or especially.

For some algorithms correctness, easiness to create in the first place, understand or change is more important than performance.

It will be difficult to really optimize one approach, let alone several for a fair comparison.
You would have to find out, where the bottlenecks are and optimize those further and further, so that the remaining performance is only caused by the different approaches.

Only after that I would start to list advantages and disadvantages of the approaches. On the other hand it can help to get an idea - as you did - what categories of differences to expect.