Can I avoid using AtomicAdd with this kernel ???

peterSteele123 · January 24, 2015, 9:48pm

Hello all,

I have created a CUDA kernel that employees a atomicAdd and is correct but I know that atomics should be avoided if possible. However I am not sure if it is entirely possible to avoid in this situation. I have thought about parallel reduction but it still comes down to potential race conditions.

Can anyone look at the following code (where N = nnn) with the atomicAdd and tell me if it would be possible to execute without using the atomicAdd ? Thank you for any help.

__global__ void myKern(float *M, float4 *A, int n, int N) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < N) {
        float4 J = device_function(A[i], n);  // device kernel function ...
        for(int k = 0; k < n; ++k) {
            atomicAdd(&M[k], A[i].w*R(n, J);  // R is a device kernel function ...
        }
    }
}

allanmac · January 24, 2015, 10:21pm

Assuming N is the number of threads in your grid, you’re calling atomicAdd() many many many times: grid_threads * n.

For any decent-sized grid, the threads will form a conga-line on the M[0…n-1] atomicAdd()'s.

I would measure your kernel first with “nvprof ” to see your average kernel duration.

Then I would do an inclusive prefix sum scan at the warp level and have the last lane in each warp perform an atomicAdd(). Or use a reduction and choose any lane. This will reduce your number of atomicAdd()'s by 32. Then measure it again with nvprof.

After that consider moving from a warp-wide to a full block-wide reduction and atomicAdd(). Measure it again!

You could avoid all atomicAdd()'s by simply writing the results to an (n x grid) sized array and then, after your kernel exits, perform n reductions in parallel… Measure it all again.

Note that you may find even partially reducing your number of atomicAdd()'s will give you good performance without having to write more code and launch more kernels.

Atomic ops are really powerful (and necessary) but can impact your kernel performance if you’re not careful.

Here are links to the session “S3101 - Understanding and Using Atomic Memory Operations”:

View PDF
View Recording

Trivia factoid: don’t forget that atomicAdd(float) will flush subnormal inputs to zero.

Obligatory picture of a conga line:

peterSteele123 · January 24, 2015, 10:37pm

Thank you ‘allanmac’ I am new to CUDA atomics and you gave some great references.

The warp-level prefix sum scan sounds like a great idea. Please excuse my ignorance about the subject but can you provide some pseudo-code for this algorithm as it applies to the kernel code that I posted - if possible?

Thank you again for your assistance.

allanmac · January 24, 2015, 10:40pm

Skimming the first link in Google looks pretty good and has some code too:

Lecture 4: warp shuffles, and reduction / scan operations.

Assuming you’re on a Kepler or newer device, pages 7 and 8 show how to quickly sum 32 values in about 10 instructions.

peterSteele123 · January 24, 2015, 11:30pm

Thanks again.

I will see how it works.

peterSteele123 · January 25, 2015, 3:17am

allanmac, after reviewing the links you posted I wrote some code up.

I won’t be able to test the results until next week but can you tell me if the following code looks okay with regards to using warp-level reduction such that the performance should improve over the original code I posted (i.e. pure atomicAdd(s)) ?

Thank you again for all your help with this.

__device__  __inline__  float warp_red(float sum) {
    sum += __shufl_xor(sum, 16);
    sum += __shufl_xor(sum, 8);
    sum += __shufl_xor(sum, 4);
    sum += __shufl_xor(sum, 2);
    sum += __shufl_xor(sum, 1);
    return sum;
}

__device__ float device_function(float4 A, int n) {

    // Does some stuff, returning float
}

__global__ void myKern(float *M, float4 *A, int n, int N) {
    int laneID = threadIdx.x % warpSize;
    int i = blockIdx.x * blockDim.x + threadIdx.x; 
    if (i < N) {
        float4 J = device_function(A[i], n);
        for (int k = 0; k < n; ++k) {
            float wsum = M[k] + A[i].w*R(n, J);
            wsum = warp_red(wsum);
            if (laneID == 0)
                atomicAdd(&M[k], wsum);
        }
    }
}

allanmac · January 25, 2015, 4:53pm

Yes, looks good to me. Minor typo: ‘shufl’ → ‘shfl’.

You might be able to hoist R(n,J) outside the for(k) loop?

It would be great to see before and after kernel runtimes (via nvprof) to see if atomics were a significant bottleneck. :)

peterSteele123 · January 25, 2015, 6:18pm

Thank you for the help.

I will definitely run a performance on the difference and see what kind of bottleneck atomics generated. I will post the results as soon as I can.

peterSteele123 · January 26, 2015, 8:51pm

Strangely it turns out that the use of Kepler atomics is actually somewhat faster than using the warp-level reductions. This is consistent regardless of the number of “points” executed. The number of “points” can roughly be defined as the value of ‘M’ in the above kernel(s).

Below is a quick comparison of the atomics only and with warp-reduction.

With warp-level reduction:
points: 512 Time (ms): 25.0361
points: 1728 Time(ms): 44.1507

Without warp-level reduction (atomics only):
points: 512 Time(ms): 24.1516
points: 1728 Time (ms): 43.3734

allanmac · January 26, 2015, 9:38pm

Welcome to Level 2. :)

Now you get to find where the real bottleneck is hiding in your code. :)

Topic		Replies	Views
Using reduction instead of atomics? CUDA Programming and Performance	9	5756	March 9, 2015
Is there a way to avoid atomicAdd in my situation? CUDA Programming and Performance	3	1399	March 4, 2019
Using atomicAdd to step through an array CUDA Programming and Performance	7	3902	May 24, 2011
atomic add operation CUDA Programming and Performance	2	4375	July 22, 2014
Warp-level reduction before atomic accumulation CUDA Programming and Performance	7	1926	December 13, 2022
can you give me sample code for atomicAdd()? CUDA Programming and Performance	9	48469	June 5, 2009
Why are Atomics discouraged? CUDA Programming and Performance	8	4438	April 29, 2013
Strange Performance Problem using atomicAdd CUDA Programming and Performance	4	1605	September 12, 2010
Atomic float operations. especially add CUDA Programming and Performance	10	16391	July 31, 2009
How to avoid using atomicAdd? CUDA Programming and Performance	0	1061	January 2, 2010

Can I avoid using AtomicAdd with this kernel ???

Related topics