From what I understand from the cuda programmers guide, each block has it’s own unique bank of 16K of shared memory. Threads in block 1 do not have access to the shared memory in block 2, etc. I wrote some code that uses this concept extensively, yet when I run it, it seems as if all the threads communicate with the same shared memory, regardless of what block they reside in.
Let me see if i can construct a stripped down example of what I’m trying to say…
// somewhere else, in the calling code
dim3 grid(16,1,1);
dim3 block(8,1,1);
f <<< grid, block, some_extra_memory>>>( d_d, d_o);
//...
__global__ void f( some_global_data* d, some_output_data *o) {
__shared__ float4 localdata[8];
__shared__ float4 calculated_data[8]
localdata[threadIdx.x] = d->data[ some_calculated_offset ];
__syncthreads();
//perform some calculation on that shared little bit of memory
//place it back in shared memory
__syncthreads();
o->data[ another_calculated_offset ] = calculated_data[threadIdx.x];
}
When I run this in emulation mode, the data that I get along the way seems to compound, keeping the previous data (leading me to believe threads are communicating across the shared memory boundaries) Where has my thinking gone wrong?
Thanks!