Can I avoid using AtomicAdd with this kernel ???

Hello all,

I have created a CUDA kernel that employees a atomicAdd and is correct but I know that atomics should be avoided if possible. However I am not sure if it is entirely possible to avoid in this situation. I have thought about parallel reduction but it still comes down to potential race conditions.

Can anyone look at the following code (where N = nnn) with the atomicAdd and tell me if it would be possible to execute without using the atomicAdd ? Thank you for any help.

__global__ void myKern(float *M, float4 *A, int n, int N) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < N) {
        float4 J = device_function(A[i], n);  // device kernel function ...
        for(int k = 0; k < n; ++k) {
            atomicAdd(&M[k], A[i].w*R(n, J);  // R is a device kernel function ...
        }
    }
}

Assuming N is the number of threads in your grid, you’re calling atomicAdd() many many many times: grid_threads * n.

For any decent-sized grid, the threads will form a conga-line on the M[0…n-1] atomicAdd()'s.

I would measure your kernel first with "nvprof " to see your average kernel duration.

Then I would do an inclusive prefix sum scan at the warp level and have the last lane in each warp perform an atomicAdd(). Or use a reduction and choose any lane. This will reduce your number of atomicAdd()'s by 32. Then measure it again with nvprof.

After that consider moving from a warp-wide to a full block-wide reduction and atomicAdd(). Measure it again!

You could avoid all atomicAdd()'s by simply writing the results to an (n x grid) sized array and then, after your kernel exits, perform n reductions in parallel… Measure it all again.

Note that you may find even partially reducing your number of atomicAdd()'s will give you good performance without having to write more code and launch more kernels.

Atomic ops are really powerful (and necessary) but can impact your kernel performance if you’re not careful.

Here are links to the session “S3101 - Understanding and Using Atomic Memory Operations”:

View PDF
View Recording

Trivia factoid: don’t forget that atomicAdd(float) will flush subnormal inputs to zero.

Obligatory picture of a conga line:

Thank you ‘allanmac’ I am new to CUDA atomics and you gave some great references.

The warp-level prefix sum scan sounds like a great idea. Please excuse my ignorance about the subject but can you provide some pseudo-code for this algorithm as it applies to the kernel code that I posted - if possible?

Thank you again for your assistance.

Skimming the first link in Google looks pretty good and has some code too:

Lecture 4: warp shuffles, and reduction / scan operations.

Assuming you’re on a Kepler or newer device, pages 7 and 8 show how to quickly sum 32 values in about 10 instructions.

Thanks again.

I will see how it works.

allanmac, after reviewing the links you posted I wrote some code up.

I won’t be able to test the results until next week but can you tell me if the following code looks okay with regards to using warp-level reduction such that the performance should improve over the original code I posted (i.e. pure atomicAdd(s)) ?

Thank you again for all your help with this.

__device__  __inline__  float warp_red(float sum) {
    sum += __shufl_xor(sum, 16);
    sum += __shufl_xor(sum, 8);
    sum += __shufl_xor(sum, 4);
    sum += __shufl_xor(sum, 2);
    sum += __shufl_xor(sum, 1);
    return sum;
}

__device__ float device_function(float4 A, int n) {

    // Does some stuff, returning float
}

__global__ void myKern(float *M, float4 *A, int n, int N) {
    int laneID = threadIdx.x % warpSize;
    int i = blockIdx.x * blockDim.x + threadIdx.x; 
    if (i < N) {
        float4 J = device_function(A[i], n);
        for (int k = 0; k < n; ++k) {
            float wsum = M[k] + A[i].w*R(n, J);
            wsum = warp_red(wsum);
            if (laneID == 0)
                atomicAdd(&M[k], wsum);
        }
    }
}

Yes, looks good to me. Minor typo: ‘shufl’ -> ‘shfl’.

You might be able to hoist R(n,J) outside the for(k) loop?

It would be great to see before and after kernel runtimes (via nvprof) to see if atomics were a significant bottleneck. :)

Thank you for the help.

I will definitely run a performance on the difference and see what kind of bottleneck atomics generated. I will post the results as soon as I can.

Strangely it turns out that the use of Kepler atomics is actually somewhat faster than using the warp-level reductions. This is consistent regardless of the number of “points” executed. The number of “points” can roughly be defined as the value of ‘M’ in the above kernel(s).

Below is a quick comparison of the atomics only and with warp-reduction.

With warp-level reduction:
points: 512 Time (ms): 25.0361
points: 1728 Time(ms): 44.1507

Without warp-level reduction (atomics only):
points: 512 Time(ms): 24.1516
points: 1728 Time (ms): 43.3734

Welcome to Level 2. :)

Now you get to find where the real bottleneck is hiding in your code. :)