I have a trouble using the shared memory.
I collect some data into a local memory and before write them in global memory, i would like to sort them using the shared memory to get coalesced writes.
here it is pseudo code of a kernel that copies data from local to shared.
short lHMD[DIM_COLUMN]; //local memory __shared__ short sHMD[BLOCK_SIZE][DIM_COLUMN]; //shared memory for (i = 0; i < (nDrawCirle) + 1; i++) /* FILL lHMD IN SOME WAY*/ __syncthreads(); /* NOW COPY THE DATA FROM THE LOCAL MEMORY TO THE SHARED MEMORY */ for(i=0; i < DIM_COLUMN; i++) sHMD[threadIdx.x][i]=lHMD[i];
Copying data from local memory, directly, into the global ( in a uncoalesced way) is faster than the code above.
On a GTX275 time increased from 0.7s to 2s. Note that the code above doesn’t write any data in global!!! :blink: