Warp Reduction in Kernel with "if" guard

I have a very computationally heavy kernel that consumes many inputs, writes many outputs, and does grid reductions within the kernel to avoid writing even more outputs. The simple way to handle the reductions is to have every thread atomicAdd to the same output. But I’d like to try to do a warp reduction before the atomicAdd: is this possible without undefined behavior? Here’s a simplified version of my code.

__device__ void computeExtremelyInvolvedFunctionThenReduce(BigDataStructure input) {
        float val = extremelyInvolvedFunction(input);
        unsigned mask = __activemask();
        for (int offset = 16 ; offset > 0; offset /= 2) {
            val  += __shfl_down_sync(mask, val, offset); // If 0 < len % 32 < 16, then this is undefined on lane 0 on first iteration?
        }
        if (threadIdx.x % 32 == 0) {
           atomicAdd(input.reductionTarget, val);
        }
}

__global__ void big_kernel(BigDataStructure input, size_t len) {
    if (threadIdx.x < len) {
        ... // Expensive stuff
        computeExtremelyInvolvedFunctionThenReduce(input);
        ... // Expensive stuff
    }
}

I believe the conclusion of the thread here: confuse about warp-level mask is that in a kernel with computation guarded by a length check where the length is not a multiple of warp-size, it is not well defined behavior to try and do a warp reduction in the guard, and the only ways around this are to restructure your code to initialize the variables outside the guard and do the reductions after the guard (not feasible if trying to write computeExtremelyInvolvedFunctionThenReduce as a standalone function that can be called from multiple kernels), or pad your data arrays with dummy data and make the length a multiple of 32 (feasible but forces significant external code changes).

Is this conclusion correct, or is there a way to do a warp-reduction inside an if-guard that may mask off the highest threads in the warp without invoking undefined behavior?

you should be able to condition the shfl_down_sync op using both conditional code and a mask which matches the intended active lanes, to handle the cases in the “guard zone”.

Do you mean something like:

__device__ void computeExtremelyInvolvedFunctionThenReduce(BigDataStructure input) {
        float val = extremelyInvolvedFunction(input);
        unsigned mask = __activemask();
        for (unsigned offset = 16 ; offset > 0; offset >> 1) {
            val  += (((1 << (offset + threadIdx.x % 32)) | mask) > 0) ? __shfl_down_sync(mask, val, offset) : 0; 
        }
        if (threadIdx.x % 32 == 0) {
           atomicAdd(input.reductionTarget, val);
        }
}

Or instead of activemask() must we architect the code to pass a precomputed mask down into computeExtremelyInvolvedFunctionThenReduce?

Something like that may work.

I would precompute the thread/warp state and the mask, and pass that into functions that needed it. I don’t really understand your skeleton code, for example this line makes no sense to me:

if (threadIdx.x < len) {

You also haven’t shown the launch configuration methodology, so I can’t really be specific. If we assumed a typical block round up strategy, then the last block (only) in the grid is the one where incomplete warps may be (I’m imagining a different code, now, not the one you have shown, which I don’t understand). I would detect that condition, and for each warp in the last block, identify its warp configuration (which threads are active) and use that, passing it into functions that need it.

If we had a partial warp, then the key things would be:

  • make the mask conform to the partial warp configuration
  • make sure to only direct threads to use values that are in the partial warp configuration/mask, in any shuffle op. So, for example, threads that need to grab a value from another place in the warp should only be considered to have valid results if that other place in the warp corresponds to an active thread identified in the mask.

You can certainly make a shuffle down reduction that behaves like that.

To clarify, I borrowed that line from the linked thread, where in that instance len was 25 (and presumably only one warp was configured to launch). But sure, replace that line with:

const int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < len) {

and assume a typical block round up strategy, so there is only one partial warp on a grid of arbitrary size, and it is divided into two contiguous portions, the lower lanes pass the check, the upper lanes fail.

I think, based on what you’ve said, this works if the mask is set properly:

        for (unsigned offset = 16 ; offset > 0; offset >> 1) {
            val  += (((1 << (offset + threadIdx.x % 32)) | mask) > 0) ? __shfl_down_sync(mask, val, offset) : 0; 
        }

where the ternary conditional is doing the “make sure to only direct threads to use values that are in the partial warp configuration/mask” key thing.

I guess then the only worry is if we absolutely need to pass down a calculated mask (unsigned mask = __ballot_sync(0xFFFFFFFF, i < len); at the beginning of the kernel), or if activemask() is sufficient. I’m guessing so long as there is even one other branch in the entire program besides the initial if-guard, its allowed for the warp to be further diverged, so there’s no way to do this “oblivously” and we absolute must update library function signatures with this mask to do warp reductions inside them.

Can you use CUDA 11?

#include <cooperative_groups.h>
#include <cooperative_groups/reduce.h>

namespace cg = cooperative_groups;

__device__
int complicatedfunction(int i){
    return i;
}

__global__ void kernel(int* output, int len) {
    int id = threadIdx.x + blockIdx.x * blockDim.x;
    if(id < len){
        int val = complicatedfunction(id);
        auto coalesced = cg::coalesced_threads();
        int reduced = cg::reduce(coalesced,val,cg::plus<int>());
        if(coalesced.thread_rank() == 0){
            atomicAdd(output, reduced);
        }
    }
}

I reached a similar conclusion here

And when your usage of shuffle is for a simple warp-aggregated atomic where the order of operations does not really matter, then I think the CG suggestion should work. It doesn’t necessarily guarantee the same behavior (as using a precomputed mask), but the result should be “the same” - allowing for atomic floating-point order of operations variability.