AtomicAdd in Shared memory is measured slower than in Global memory. Timing, Shared memory, Atomic o

skchoe · February 21, 2012, 10:09pm

Dear all,

I wrote 2 kernels to see how much atomicAdd() to shared memory is faster than global memory.
Kernel is simple: Just keep add from i=0 … ITER-1, under 16 threads in a block across 256/16 blocks.

The result I cannot understand is:
atomicAdd to Shared memory - 140ms
atomicAdd to Direct to Global memory- 90ms

It would be so appreciated if you drop a line.

SK.

Here’s simple codes:
#define WARP_WIDTH 16
#define W 256
#define ITER 1000000

///////////////AtomicAdd to Shared memory ‘shd’//////////////////
global void kernel_shdatm(int* in, int* out)
{
int j = threadIdx.x + blockDim.x * blockIdx.x;
shared int shd[WARP_WIDTH];
shd[threadIdx.x] = in[j];

int i;
for(i=0;i<ITER;i++)
  atomicAdd((int*)&(shd[threadIdx.x]), i );

out[j] =shd[threadIdx.x];
__syncthreads();
return;

}

///////////////AtomicAdd to global memory ‘out’//////////////////
global void kernel_glbatm(int* in, int* out)
{
int j = threadIdx.x + blockDim.x * blockIdx.x;
int i;
for(i=0;i<ITER;i++)
atomicAdd((int*)&(out[j]), i);

__syncthreads();
return;

}

////////////////////////////////////////////////////////////////////
// kernel call

// to shared memory->global memory copy
kernel_shdatm<<<W/WARP_WIDTH, WARP_WIDTH>>>(g_in, g_out);

// to global memory directly.
kernel_glbatm<<<W/WARP_WIDTH, WARP_WIDTH>>>(g_ing, g_outg);

The time is measured by surrounding each of them including memory alloc/copy with cudaEvent…().

Gregory_Diamos · February 21, 2012, 10:40pm

The short answer is don’t use shared memory atomics if you care about performance.

skchoe · February 22, 2012, 1:24am

Thanks Gregory for reply.

Yes, it’s a rule of thumb “Avoid Atomics!”

The original intention of the question is on the comparison between Atomics to Global with Shared memory.I wanted to see the overhead to copy from/to global memory and shared memory will be relatively smaller if the cost atomics is so large for each thread. In the code 10000 times atomicAdd() would be costlier enough than one time copy between global/shared memory.

Am I looking into this situation legitimately as designed?

Thanks,

SK.

I just wanted to see the theory(?) :

in my example.

Topic		Replies	Views
atomicAdd: to shared memory / to global memory=====which is faster? (for Turing or later) CUDA Programming and Performance	3	1527	August 23, 2024
Shared memory atomicAdd is slower than that of using global memory CUDA Programming and Performance cuda	0	578	June 9, 2021
Atomic instructions on global and shared memory CUDA Programming and Performance	9	2789	May 27, 2022
Where do atomic operations go, and why are atomics to __shared__ faster than those to GMEM? CUDA Programming and Performance	6	3223	July 11, 2022
Worse atomic performance in shared than global memory CUDA Programming and Performance	7	9166	August 3, 2017
cuda by example Atomics CUDA Programming and Performance	1	494	December 17, 2019
How much faster are atomicAdd() operations to __shared__ on SM >= 5X? CUDA Programming and Performance	3	5071	October 24, 2017
Performance of Atomic operations CUDA Programming and Performance	2	2758	December 17, 2008
Does number of shared memory banks effect results? CUDA Programming and Performance	6	1247	May 29, 2011
Fast min/max function in shared memory CUDA Programming and Performance	2	7000	February 24, 2010

AtomicAdd in Shared memory is measured slower than in Global memory. Timing, Shared memory, Atomic o

Related topics