Local vs Shared Memory execution slows down when using shared memory

I have a trouble using the shared memory.

I collect some data into a local memory and before write them in global memory, i would like to sort them using the shared memory to get coalesced writes.

here it is pseudo code of a kernel that copies data from local to shared.

short lHMD[DIM_COLUMN]; //local memory

	__shared__   short  sHMD[BLOCK_SIZE][DIM_COLUMN]; //shared memory

	for (i = 0; i < (nDrawCirle) + 1; i++)

		/* FILL lHMD IN SOME WAY*/

	__syncthreads();

/* NOW COPY THE DATA FROM THE LOCAL MEMORY TO THE SHARED MEMORY */

	for(i=0; i < DIM_COLUMN; i++)

		sHMD[threadIdx.x][i]=lHMD[i];

Copying data from local memory, directly, into the global ( in a uncoalesced way) is faster than the code above.

On a GTX275 time increased from 0.7s to 2s. Note that the code above doesn’t write any data in global!!! :blink:

any suggestions??

I’d maybe be able to tell you something if you’d pasted both versions of your code.

the question is: how can be possible that a collision-free access to the shared memory results slower than an uncoalesced access to global memory?

In the code we have a row of the shared matrix reserved to each thread, and each thread writes DIM_COLUMN locations in an exclusive fashion.

The writes in shared are almost 3 times slower…

Obviously with the code that writes only in shared memory we don’t get any result (but we get worst times anyway); in the other one, we write results back in global (from lHMD) with a stride of 1520, so this shouldn’t be a coalesced access at all.

by the way this is the code that writes back in global from lHMD array:

#define DIM_STRIDE	1520		

	 

	int vala = ((DIM_STRIDE * (threadIdx.x+(blockDim.x*blockIdx.y)))+(DIM_COLUMN*blockIdx.x));

	for (i =0; i < DIM_COLUMN; i++)

			HMD[vala+i]=lHMD[i];

HMD is a global array.

It could be (just a rough guess) that in case with shared mem your array is so huge that scheduler can’t put more than just a few of the blocks on to a MP so that you get just a few blocks per MP stuck in reading from local mem. And in case without shared mem you might get a very good MP occupancy and those uncoalesced writes get hidden well. You could try running cuda profiler on your code.

Thanks for reply sergeyn.

Today we’ll try to run cuda profiler again and we’ll post results here.

In the meanwhile,i’ve done some tests to better understand the problem but results are some kind of amazing…

this code:

#define DIM_STRIDE	1520		//SIDEX

#define DIM_PLINTH	64			//SIDEY / blockDim.X - DIM_PLINTH * BLOCKDIM.X DEVE FARE 1520

#define MAX_BLOCK_DIM 64

	int i;

	short lHMD[DIM_PLINTH];

	for (i = 0; i < DIM_PLINTH; i++)

		lHMD[i] = 0;

	for (i = 0; i < (nDrawCirle) + 1; i++)

		peakLoop(lut_main, &d_peakCalc_array[i], lHMD, nbindb); //do some comp and store data in lHMD[]

	int vala = ((DIM_STRIDE * (threadIdx.x+(blockDim.x*blockIdx.y)))+(DIM_PLINTH*blockIdx.x));

	for (i =0; i < DIM_PLINTH; i++)

			HMD[vala+i]=lHMD[i];				//copy in Global

takes 804ms.

This one insted:

#define DIM_STRIDE	1520		

#define DIM_PLINTH	64		

#define MAX_BLOCK_DIM 64

__global__

void mainKernel(t_peakCalc * d_peakCalc_array, float *lut_main, houghD_t *HMD,

		int nbindb, int nDrawCirle) {

	int i;

	__shared__ 	short sHMD[MAX_BLOCK_DIM][DIM_PLINTH];

	for (i = 0; i < (nDrawCirle) + 1; i++)

		peakLoop(lut_main, &d_peakCalc_array[i], &sHMD[threadIdx.x][0], nbindb);  //do some comp and store data in sHMD[threadIdx.x][i]

	__syncthreads();

	int vala = ((DIM_STRIDE * (threadIdx.x+(blockDim.x*blockIdx.y)))+(DIM_PLINTH*blockIdx.x));

	for (i =0; i < DIM_PLINTH; i++)

			HMD[vala+i]=sHMD[threadIdx.x][i];				//copy in Global

takes 2457 ms.

Cutting off the final loop that copy results in global we get 2507 ms.

Hi,

Two things that are immidiatly visible.

for (i = 0; i < (nDrawCirle) + 1; i++)

		peakLoop(lut_main, &d_peakCalc_array[i], lHMD, nbindb);

your gmem access is not coalesced - a big performance penalty. All threads accessing same addresses all over again.

  1. As suggested by sergeyn the shared memory allocation is 64 * 64 * sizeof(short) = 8192 bytes which means only 2 blocks

can run at one time - decreasing your occupancy (check also the register usage to see how much real occupancy you’re getting)

hope that helps

eyal

don’t forget about at least 16 bytes of shared mem used for parameters, so at the best you have 1 block active + another 8176 bytes are not used.