Warp Reduction in Kernel with "if" guard

mikemx7f · December 28, 2020, 8:22pm

I have a very computationally heavy kernel that consumes many inputs, writes many outputs, and does grid reductions within the kernel to avoid writing even more outputs. The simple way to handle the reductions is to have every thread atomicAdd to the same output. But I’d like to try to do a warp reduction before the atomicAdd: is this possible without undefined behavior? Here’s a simplified version of my code.

__device__ void computeExtremelyInvolvedFunctionThenReduce(BigDataStructure input) {
        float val = extremelyInvolvedFunction(input);
        unsigned mask = __activemask();
        for (int offset = 16 ; offset > 0; offset /= 2) {
            val  += __shfl_down_sync(mask, val, offset); // If 0 < len % 32 < 16, then this is undefined on lane 0 on first iteration?
        }
        if (threadIdx.x % 32 == 0) {
           atomicAdd(input.reductionTarget, val);
        }
}

__global__ void big_kernel(BigDataStructure input, size_t len) {
    if (threadIdx.x < len) {
        ... // Expensive stuff
        computeExtremelyInvolvedFunctionThenReduce(input);
        ... // Expensive stuff
    }
}

I believe the conclusion of the thread here: confuse about warp-level mask is that in a kernel with computation guarded by a length check where the length is not a multiple of warp-size, it is not well defined behavior to try and do a warp reduction in the guard, and the only ways around this are to restructure your code to initialize the variables outside the guard and do the reductions after the guard (not feasible if trying to write computeExtremelyInvolvedFunctionThenReduce as a standalone function that can be called from multiple kernels), or pad your data arrays with dummy data and make the length a multiple of 32 (feasible but forces significant external code changes).

Is this conclusion correct, or is there a way to do a warp-reduction inside an if-guard that may mask off the highest threads in the warp without invoking undefined behavior?

Robert_Crovella · December 28, 2020, 8:49pm

you should be able to condition the shfl_down_sync op using both conditional code and a mask which matches the intended active lanes, to handle the cases in the “guard zone”.

mikemx7f · December 28, 2020, 9:15pm

Do you mean something like:

__device__ void computeExtremelyInvolvedFunctionThenReduce(BigDataStructure input) {
        float val = extremelyInvolvedFunction(input);
        unsigned mask = __activemask();
        for (unsigned offset = 16 ; offset > 0; offset >> 1) {
            val  += (((1 << (offset + threadIdx.x % 32)) | mask) > 0) ? __shfl_down_sync(mask, val, offset) : 0; 
        }
        if (threadIdx.x % 32 == 0) {
           atomicAdd(input.reductionTarget, val);
        }
}

Or instead of activemask() must we architect the code to pass a precomputed mask down into computeExtremelyInvolvedFunctionThenReduce?

Robert_Crovella · December 28, 2020, 9:48pm

Something like that may work.

I would precompute the thread/warp state and the mask, and pass that into functions that needed it. I don’t really understand your skeleton code, for example this line makes no sense to me:

if (threadIdx.x < len) {

You also haven’t shown the launch configuration methodology, so I can’t really be specific. If we assumed a typical block round up strategy, then the last block (only) in the grid is the one where incomplete warps may be (I’m imagining a different code, now, not the one you have shown, which I don’t understand). I would detect that condition, and for each warp in the last block, identify its warp configuration (which threads are active) and use that, passing it into functions that need it.

If we had a partial warp, then the key things would be:

make the mask conform to the partial warp configuration
make sure to only direct threads to use values that are in the partial warp configuration/mask, in any shuffle op. So, for example, threads that need to grab a value from another place in the warp should only be considered to have valid results if that other place in the warp corresponds to an active thread identified in the mask.

You can certainly make a shuffle down reduction that behaves like that.

mikemx7f · December 28, 2020, 11:15pm

To clarify, I borrowed that line from the linked thread, where in that instance len was 25 (and presumably only one warp was configured to launch). But sure, replace that line with:

const int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < len) {

and assume a typical block round up strategy, so there is only one partial warp on a grid of arbitrary size, and it is divided into two contiguous portions, the lower lanes pass the check, the upper lanes fail.

I think, based on what you’ve said, this works if the mask is set properly:

        for (unsigned offset = 16 ; offset > 0; offset >> 1) {
            val  += (((1 << (offset + threadIdx.x % 32)) | mask) > 0) ? __shfl_down_sync(mask, val, offset) : 0; 
        }

where the ternary conditional is doing the “make sure to only direct threads to use values that are in the partial warp configuration/mask” key thing.

I guess then the only worry is if we absolutely need to pass down a calculated mask (unsigned mask = __ballot_sync(0xFFFFFFFF, i < len); at the beginning of the kernel), or if activemask() is sufficient. I’m guessing so long as there is even one other branch in the entire program besides the initial if-guard, its allowed for the warp to be further diverged, so there’s no way to do this “oblivously” and we absolute must update library function signatures with this mask to do warp reductions inside them.

striker159 · December 29, 2020, 9:17am

Can you use CUDA 11?

#include <cooperative_groups.h>
#include <cooperative_groups/reduce.h>

namespace cg = cooperative_groups;

__device__
int complicatedfunction(int i){
    return i;
}

__global__ void kernel(int* output, int len) {
    int id = threadIdx.x + blockIdx.x * blockDim.x;
    if(id < len){
        int val = complicatedfunction(id);
        auto coalesced = cg::coalesced_threads();
        int reduced = cg::reduce(coalesced,val,cg::plus<int>());
        if(coalesced.thread_rank() == 0){
            atomicAdd(output, reduced);
        }
    }
}

Robert_Crovella · December 29, 2020, 3:49pm

I reached a similar conclusion here

And when your usage of shuffle is for a simple warp-aggregated atomic where the order of operations does not really matter, then I think the CG suggestion should work. It doesn’t necessarily guarantee the same behavior (as using a precomputed mask), but the result should be “the same” - allowing for atomic floating-point order of operations variability.

Topic		Replies	Views
Warp-level reduction before atomic accumulation CUDA Programming and Performance	7	2555	December 13, 2022
__shfl_down_sync weird behavior CUDA Programming and Performance cuda , ubuntu	5	1575	November 23, 2021
Faster Parallel Reductions on Kepler Technical Blog	53	2390	September 4, 2021
Is there warp divergence in reduce0 kernel which is implemented in the CUDA sample Reduction? CUDA Programming and Performance	4	927	January 8, 2020
Warp-Aggregate AtomicAdd CUDA Programming and Performance	3	2139	March 10, 2015
ask help about the SDK demo: reduction CUDA Programming and Performance	5	1270	March 31, 2010
Use __shfl_down_sync for multiple variables? CUDA Programming and Performance	2	830	July 7, 2022
confuse about warp-level mask CUDA Programming and Performance	12	3233	November 1, 2018
Can I avoid using AtomicAdd with this kernel ??? CUDA Programming and Performance	9	3679	January 26, 2015
questions about GPU parallel reduction variant CUDA Programming and Performance	1	953	June 27, 2011

Warp Reduction in Kernel with "if" guard

Related topics