Call to _syncThreads() not needed?

I’m running a simple kernel to reverse arrays using shared memory, a la 'Supercomputing for the Masses: Part 3". Below is my kernel:

global void reveseArrayUsingSharedMem(float *d_in, float *d_out)
extern shared float s;
int tid = threadIdx.x + blockDim.x * blockIdx.x;
if(threadIdx.x%10 == 0)
printf(“I am threadId %d in blockId %d\n”,threadIdx.x,blockIdx.x);

    s[threadIdx.x] = d_in[tid];


int rtid = blockDim.x*gridDim.x - blockDim.x*blockIdx.x - threadIdx.x - 1;
d_out[rtid] = s[threadIdx.x];


I’ve run this with over 5 million elements (threads = 512, blocks = 10240) and I always get the correct answer whether or not I include the blocking call to __syncthreads(). How come? Shouldn’t I see breaks if the writes to shared memory are not sync’d? I’ve even added the above print statement for certain threads to ‘slow down’ their wrtes to shared mem, but I don’t see any difference.


Your kernel requires neither thread coordination nor resource sharing so there is no need for explicit synchronization.

Well, that makes a lot of sense :) Thanks for having a look.