Atomic instructions on global and shared memory

Cudip · July 30, 2011, 2:41pm

Hi gys,

I am trying to use atomicadd instruction on shared memory to speed up my code, but it is having the opposite effect.

Atomic instruction on global memory is as follows:

__global__ void calcCentroidKernel( int *gpu_labels, int *gpu_nRegions, int *gpu_regionOff, int *gpu_regionSize, int *gpu_centroid, int *i, int pitch)

{

	int     x = (blockIdx.x*blockDim.x)+threadIdx.x;

        int     y = (blockIdx.y*blockDim.y)+threadIdx.y;  

        int index = x+y*pitch;

        int     j = 0;

	if( gpu_labels[index] == index)

	{

		atomicAdd( i, 1);						

		atomicAdd( gpu_regionOff + (*i), index);		

	}

	for( j=0; j < *gpu_nRegions; j++)

	{

		if( gpu_labels[index] == gpu_regionOff[j])

			break;

	}

	if( gpu_labels[index] != -1)

	{

		atomicAdd( gpu_centroid+(2*j), x);

		atomicAdd( gpu_centroid+(2*j)+1, y);

		atomicAdd( gpu_regionSize+j, 1);

	}	

}

Atomic instruction on shared memory is as follows:

__global__ void calcCentroidSharedKernel(int *gpu_labels,int *gpu_nRegions,int *gpu_regionOff,int *gpu_regionsSize,int *gpu_centroid,int *i,int pitch)

{

	extern __shared__ int sMem[];

	int     	    x = (blockIdx.x*blockDim.x)+threadIdx.x;

        int     	    y = (blockIdx.y*blockDim.y)+threadIdx.y;  

        int 	        index = x+y*pitch;

        int 		    j = 0; 

        int 	shSize_Offset = 0;

        int shCentroid_Offset = *gpu_nRegions;

if( gpu_labels[index] == index)

        {

		atomicAdd( i, 1);

		atomicAdd( gpu_regionOff + (*i), index);

        }

for( j=0; j < *gpu_nRegions; j++)

	{

			if( gpu_labels[index] == gpu_regionOff[j])

			break;

	}

	if(gpu_labels[index] != -1)

	{

		atomicAdd( sMem+shCentroid_Offset+(2*j), x);

		atomicAdd( sMem+shCentroid_Offset+(2*j)+1, y);

		atomicAdd( sMem+shSize_Offset+j, 1);

	}

	__syncthreads();

	atomicAdd(gpu_centroid+(2*j), sMem[shCentroid_Offset+(2*j)]);

	atomicAdd(gpu_centroid+(2*j)+1, sMem[shCentroid_Offset+(2*j)+1]);

	atomicAdd(gpu_regionsSize+j, sMem[shSize_Offset+j]);

	

}

I am unable to understand why the timing of my code increased after using shared memory. Though it looks like shared memory usage is just a over head but considering that shared memory is used >200 times, the usage of shared memory should have provided considerable timing improvement…

tera · July 30, 2011, 4:35pm

Your shared memory version is doing the shared memory atomic adds in addition to those done by the other version as well. So why would you expect it to run faster?

BTW: Did you compare tesults of both versions? The shared memory version lacks initialization of the shared memory, so you’d be extremely lucky if they produced the same results.

Cudip · July 30, 2011, 4:52pm

Shared memory should perform faster than global because of the following reasons…

Shared Memory access is faster than global memory access.
Assuming i have around 1000 additions to be done…
So in the first case all those additions will be done on global memory (assuming 1000 accesses)
Whereas in the second case firstly additions will be done on shared memory and then depending on grid size ((16,12) in this case) there will be much fewer global atomic additions. So theoretically second should be faster

And yes you are right i didnt initialize shared memory (though i only measured timing didnt check the output).

Cudip · July 30, 2011, 5:11pm

I think i messed up after __syncthreads()…

The code after it should be this…

__syncthreads();

if(index == 0)

{

        atomicAdd(gpu_centroid+(2*j), sMem[shCentroid_Offset+(2*j)]);

        atomicAdd(gpu_centroid+(2*j)+1, sMem[shCentroid_Offset+(2*j)+1]);

        atomicAdd(gpu_regionsSize+j, sMem[shSize_Offset+j]);

}

I am able to get performance improvement now. Could you point out any optimizations that can be done in this code (The timing i am recieving now is still too much). Is there any other approach that i can follow?

tera · July 30, 2011, 6:10pm

You probably want to make this

__syncthreads();

if((threadIdx.x==0) && (threadIdx.y==0))

{

        atomicAdd(gpu_centroid+(2*j), sMem[shCentroid_Offset+(2*j)]);

        atomicAdd(gpu_centroid+(2*j)+1, sMem[shCentroid_Offset+(2*j)+1]);

        atomicAdd(gpu_regionsSize+j, sMem[shSize_Offset+j]);

}

in order to get correct results.

I don’t understand what these kernels are doing, so I can’t suggest improvements. However the scattered atomicAdds are probably the worst case for memory throughput and anything you could do to at least partly coalesce them would certainly help.

Cudip · July 30, 2011, 6:41pm

These kernels are calculating the size and centroid of an arbitrary object shape in a image. The region Offset is the index value of top left pixel of the shape. I know the atomicAdds are all scattered accross the object shape, I couldnt find a way to coalesce them.

After the modifications this is code i came up with:

__global__ void calcCentroidSharedKernel(int *gpu_labels,int *gpu_nRegions,int *gpu_regionOff,int *gpu_regionsSize,int *gpu_centroid,int *i,int pitch)

{

	extern __shared__ int sMem[];

	int     			x = (blockIdx.x*blockDim.x)+threadIdx.x;

        int     			y = (blockIdx.y*blockDim.y)+threadIdx.y;  

        int 			    index = x+y*pitch;

        int 				j = 0; 

        int 	            shSize_Offset = 0;

        int             shCentroid_Offset = *gpu_nRegions-1;         // This line is modified.

	if((index >= 0) && (index < (3*(*gpu_nRegions))))            // This is the added lines of code.

		sMem[index] = 0;

if( gpu_labels[index] == index)

        {

		atomicAdd( i, 1);

		atomicAdd( gpu_regionOff + (*i), index);

        }

for( j=0; j < *gpu_nRegions; j++)

	{

			if( gpu_labels[index] == gpu_regionOff[j])

			break;

	}

	if(gpu_labels[index] != -1)

	{

		atomicAdd( sMem+shCentroid_Offset+(2*j), x);

		atomicAdd( sMem+shCentroid_Offset+(2*j)+1, y);                          /******** Program crashing code line **********/

		atomicAdd( sMem+shSize_Offset+j, 1);

	}

	__syncthreads();

	if((threadIdx.x == 0) && (threadIdx.y == 0))                                    // This is the condition modified

	{

		atomicAdd(gpu_centroid+(2*j), sMem[shCentroid_Offset+(2*j)]);

		atomicAdd(gpu_centroid+(2*j)+1, sMem[shCentroid_Offset+(2*j)+1]);       /******** Program crashing code line **********/

		atomicAdd(gpu_regionsSize+j, sMem[shSize_Offset+j]);

	}

}

While I was testing this with a test video, the program abruptly crashed. I was able to find out that if i comment the marked atomic instructions the video runs fine.

EDIT: I forgot to ask can you point out reason for the crash…

jack · July 31, 2011, 2:36am

What data are you passing to the kernel? If *gpu_nRegions == 0, shCentroid_Offset is set to -1, the loop with ‘j’ as the index will never execute (so the value of ‘j’ stays zero). That would cause the first atomicAdd in each of the blocks at the bottom to use an out-of-bounds address (not the second/middle lines as you’ve marked with the comments).

Cudip · July 31, 2011, 4:50am

If *gpu_nRegions == 0 then the kernel is never called. Parameters are as follow:

@ gpu_labels = It is global memory that contains -1 value for each pixel that is background. And some positive integer where there is a shape. All the pixels within that shape have the same value (I had to do connected comppnent labelling before this) and the value is top left pixel of the shape.

@ gpu_nRegions = It contains total number of shapes in the image i.e total number of cirles,sqaures, etc…

@ gpu_regionOff = it is top left pixel of each shape.

@ gpu_regionsSize = It is size of the shape to be found.

@ gpu_centroid = It is the centroid of each shape to be found.

@ i = It is just a counter.

@ pitch = It is the image width.

Cudip · July 31, 2011, 3:16pm

Hi gys,

I came up with some kind of workaround to the crashing problem…

Initially the code used to call the kernel was this:

int *i;

	cudaMalloc( (void **)&i, sizeof(int));

	cudaMemset( i, -1, sizeof(int));

	int shSize = (nRegions)*3*sizeof(int);

	dim3 block( 15, 16, 1);

	dim3 grid( imageW / block.x, imageH / block.y, 1);

	if(shSize != 0)

	{

		calcCentroidSharedKernel<<<grid,block,shSize>>>(gpu_labels,gpu_nRegions,gpu_regionOff,gpu_regionSize,gpu_centroid,i,imageW);

	}

I modified the code to:

int *i;

	cudaMalloc( (void **)&i, sizeof(int));

	cudaMemset( i, -1, sizeof(int));

	int shSize = (nRegions)*4*sizeof(int);         // This is the line modified

	dim3 block( 15, 16, 1);

	dim3 grid( imageW / block.x, imageH / block.y, 1);

	if(shSize != 0)

	{

		calcCentroidSharedKernel<<<grid,block,shSize>>>(gpu_labels,gpu_nRegions,gpu_regionOff,gpu_regionSize,gpu_centroid,i,imageW);

	}

also inside the kernel the shCentroid_Offset should be = *gpu_nRegions; , not *gpu_nRegions-1;

The code now works fine, but the shared memory gets wasted this way. I am just curious why it didnt work previously… (somehow i think the size of shared memory 3*nRegions was not enough )

202476410arsmart · May 27, 2022, 9:44am

Does it really faster than global? Thank you!!
Also, your discussion is very useful for later learner like me!!!

Topic		Replies	Views
Worse atomic performance in shared than global memory CUDA Programming and Performance	7	8911	August 3, 2017
Performance issues on memory transfer CUDA Programming and Performance	13	12981	November 26, 2010
Shared mem atomics Repeat topic CUDA Programming and Performance	47	8781	December 1, 2009
How to speed up AtomicAdd kernel using shared memory CUDA Programming and Performance	9	9180	September 30, 2022
Really simple while loop issues CUDA Programming and Performance	4	3149	October 27, 2014
Strange Performance Problem using atomicAdd CUDA Programming and Performance	4	1604	September 12, 2010
help getting shared memory working CUDA Programming and Performance	11	4289	June 12, 2007
Is there a way to avoid atomicAdd in my situation? CUDA Programming and Performance	3	1392	March 4, 2019
Optimizing CUDA kernel with atomics CUDA Programming and Performance	10	211	August 19, 2024
AtomicAdd in Shared memory is measured slower than in Global memory. Timing, Shared memory, Atomic o CUDA Programming and Performance	2	7994	February 22, 2012

Atomic instructions on global and shared memory

Related topics