Using reduction instead of atomics?

Hello all,

Suppose I have a kernel, that as part of its algorithm must sum and store values in the same array index - e.g. generates a race-condition. Currently I employ AtomicAdd then continue with the rest of the kernel. Needless to say, the AtomicAdd generates a huge performance hit I would like to avoid. Is there a way to employ a CUDA reduction as part of an existing kernel without calling another kernel ? - i.e. I can’t count on the use of dynamic parallelism as the hardware generations employed may vary.

Thank you to anyone willing to help.

A general outline of my kernel follows (please note this is just a brief and rough outline, not exacting) :

__global__ void myKernel(float *A, float *B, float r, int N) {
    int idx = threadIdx.x + BlockDim.x * BlockIdx.x;
    if (idx < N) {
        // code to generate a value via a device call
        float tmp = myKernel_A(A, r);
        
        // store at element
        atomicAdd(B[idx], tmp);

        // continue with rest of kernel operations
        ...
    }
}

You do not need dynamic parallelism for that. Just invoke a second kernel from the host.

Other than that, it is impossible to give advice on the best solution without more information:

  • How many different array indices are there?
  • How many threads want to the same array element?
  • Is there a regular pattern of which threads add to which array elements?
  • Is it possible to rearrange threads so that all (or at least many) threads adding to the same element are in the same block?

tera,

Thank you for the reply.

Please excuse my nativity with regards to CUDA, I am learning. Is there no way to employ a reduction without calling a separate kernel from the host? If so, can you please point me to a link and/or pseudo-code that would accomplish this? I would like to avoid calling another kernel from the host just for the reduction part of the code.

Regards.

I’m not sure why you need an atomic add at all.

  • every thread has a globally unique index (idx)
  • therefore each thread is adding it’s local tmp value to a unique location in B:
atomicAdd(B[idx], tmp);

I don’t see any reduction going on here, nor a need for atomics. According to the code you have shown here, you should be able to use:

B[idx] += tmp;

You can do a block-level reduction without calling a separate kernel. CUB can help with primitives for this:

http://nvlabs.github.io/cub/

A device-wide reduction will require some sort of device-wide synchronization scheme. The most convenient is the kernel launch.

txtbob,

Sorry my mistake, the idx is actually derived from an array of indices passed to the kernel. A more detailed kernel follows and should clear-up this. Once again, I apologize for incorrect kernel algorithm I posted.

__global__ void myKernel (float *A, float *B, int *indices, float r, int N) {

    int idx = threadIdx.x + blockDim.x * blockIdx.x;

    if (idx < N) {
        // get index associated with this thread - 
        // NOTE: not unique as many different threads can map to same index
        int i = indices [ idx ];

        // code to generate a value via a device call
        float tmp = myKernel_A(A, r);
        
        // store at element
        atomicAdd ( B[ i ],  tmp);

        // continue with rest of kernel operations
        ...
    }
}

This is actually Q reductions going on in parallel, then, where Q is the number of unique indices (of i) that are actually present/extant/used.

A simple reduction will not help here.
It will be difficult to solve this on a device-wide basis without using device-wide sync mechanisms, or atomics.

An alternative approach to atomics is outlined here:

http://stackoverflow.com/questions/28555479/reduction-or-atomic-operator-on-unknown-global-array-indices

(Be advised there is no actual worked example there. It might be that the approach is not faster than atomics. Performance comparisons of these two algorithmic approaches, such as here:

http://stackoverflow.com/questions/27894654/cuda-or-thrust-getting-n-maximum-numbers-in-a-grouped-array

will depend signfiicantly on the actual data sets, including the “density” of atomics. Kepler in particular has significantly improved the performance of global atomics.)

You can apply this at the block level if you want (thereby avoiding the need for a device-wide sync), but each block will then need to accumulate its partial results into the global results contained in B (perhaps using atomics). This isn’t a trivial problem that I would suggest is appropriate for someone at the beginner level.

txtbob,

Thanks for the information.

The block-level reduction sounds promising, applying the atomics at the block-level versus the thread-level may be just the performance boost I am looking for. Can you provide a link/pseudo-code to apply this block-level reduction?

Thank you again.

I’m not talking about a single block-level reduction. I’m talking about a block-level reduce-by-key operation (i.e. a segmented reduction).

At the device-wide level, CUB has a function:

http://nvlabs.github.io/cub/structcub_1_1_device_reduce.html#a4822e04d8701b10ac3f2d28effb454d3

But I don’t have something equivalent to point you to at the block level.

How scattered are the indices given by

indices[ threaIdx.x + blockIdx.x*blockDim.x]

?

If the range is not greater than say 2^10 or 2^14 you should consider doing the reductions in shared memory on a block local level first.

Sorry ,to interfere , just a question to txbob.

But ,like this don’t we have a race condition?