Shared memory vs global memory

I call my kernel with 2 blocks, each 64 threads. Those 64 threads needs read access to the data within the other 63 threads. So each block is independent.

I thought the right way is to copy the data from global memory to shared memory, do the computation and copy back?

All this because shared memory is the fastest, isn’t it?

As long as I use 1 block, everything works well, but when I use 2 or more, I get more or less random data.

Any hint?

global void testKernel( float* g_idata, float* g_odata)
extern shared SQUARE sdata;

const unsigned int tid = threadIdx.x;
const unsigned int gid = blockIdx.x;

SQUARE pBoard = (SQUARE) &g_idata[gid64];
for ( int n =0; n< 64; n++ )
64].data = pBoard[n].data;

SQUARE psBoard = (SQUARE) &sdata[gid*64];


ExecuteGPUFunc( psBoard, tid );


SQUARE pBoardOut = (SQUARE)&g_odata[gid*64];
for ( int n =0; n< 64; n++ )
pBoardOut[n].data = psBoard[n].data;

dim3 grid( 2, 1 , 1);
dim3 threads( 64, 1, 1);
testKernel<<< grid, threads, mem_size >>>( d_idata, d_odataTemp);

Chuck out the for loops, subst n +>tid and it might be happier - is the hardware complaining about being asked to write the same location 64 times?

But when i use “tid” for “n” only one elements get copied. I need all 64 with the block.

You have 64 threads for 64 elements, so each thread has to read only one element. You do not need a for-loop.

And you do not need the offset of 64*gid when accessing shared memory. Each block has its own shared memory.

So replace

for ( int n =0; n< 64; n++ )


sdata[n+gid*64].data = pBoard[n].data;



sdata[tid].data = pBoard[tid].data;

All 64 elements should be copied this way.

It took a while, but I understand it now. Thanks a lot.

It works perfect, but only for one chunk of 64 threads.

Perhaps my configuration is wrong.

dim3 grid( nBoards, 1 , 1);
dim3 threads( num_threads, 1, 1);

// execute the kernel
testKernel<<< grid, threads, mem_size >>>( d_idata, d_odataTemp);

In the docs I read that the third argument (“mem_size”) is the size of shared memory. But in some samples the third argument is omitted but in the kernel it is used.

The third argument allocates shared mem that is bound to the as extern specified shared variables. If you put the shared array size into the code directly, you don’t need this argument. Example: this does not need the argument

__shared__ int A[100];

this does

extern __shared__ int A[];

The advantage of the latter is that you can control it at runtime. Read the manual for how to get the address correct in the kernel.