Hi,
I’m running a simple kernel to reverse arrays using shared memory, a la 'Supercomputing for the Masses: Part 3". Below is my kernel:
global void reveseArrayUsingSharedMem(float *d_in, float *d_out)
{
extern shared float s;
int tid = threadIdx.x + blockDim.x * blockIdx.x;
if(threadIdx.x%10 == 0)
printf(“I am threadId %d in blockId %d\n”,threadIdx.x,blockIdx.x);
s[threadIdx.x] = d_in[tid];
//__syncthreads();
int rtid = blockDim.x*gridDim.x - blockDim.x*blockIdx.x - threadIdx.x - 1;
d_out[rtid] = s[threadIdx.x];
}
I’ve run this with over 5 million elements (threads = 512, blocks = 10240) and I always get the correct answer whether or not I include the blocking call to __syncthreads(). How come? Shouldn’t I see breaks if the writes to shared memory are not sync’d? I’ve even added the above print statement for certain threads to ‘slow down’ their wrtes to shared mem, but I don’t see any difference.
Thanks