Shared Memory Problem memory shared only within blocks?

From what I understand from the cuda programmers guide, each block has it’s own unique bank of 16K of shared memory. Threads in block 1 do not have access to the shared memory in block 2, etc. I wrote some code that uses this concept extensively, yet when I run it, it seems as if all the threads communicate with the same shared memory, regardless of what block they reside in.

Let me see if i can construct a stripped down example of what I’m trying to say…

// somewhere else, in the calling code

dim3 grid(16,1,1); 

dim3 block(8,1,1); 

f <<< grid, block, some_extra_memory>>>( d_d, d_o);

//...

__global__ void f( some_global_data* d, some_output_data *o) {

 __shared__ float4 localdata[8];

  __shared__ float4 calculated_data[8]

  localdata[threadIdx.x] = d->data[ some_calculated_offset ];

  __syncthreads();

 //perform some calculation on that shared little bit of memory

  //place it back in shared memory

  __syncthreads();

 o->data[ another_calculated_offset ] = calculated_data[threadIdx.x];

}

When I run this in emulation mode, the data that I get along the way seems to compound, keeping the previous data (leading me to believe threads are communicating across the shared memory boundaries) Where has my thinking gone wrong?

Thanks!

emulation mode is different from running on the device. So that may explain the issues you are seeing. I believe I read sometimes that people have trouble because on device things don’t work. Usually they were in emulation mode communicating over blocks.

I understand that could be an issue, what I’m wondering is whether or not I’m thinking of shared memory incorrectly

No you are completely spot on, shared mem is only visible to the threads that belong to 1 block.

You should inintialize shared memory at the beginning of the kernel.

In emulation mode, it seems from your test that each block is run sequentially on the same shared memory area. On the device this won’t be the case, but after one block completes, another will be launched possibly using the same shared memory of the one that just finished.