Why barrier synchronisation not needed here from best practices guide page number 29

From best practices guide page number 29: best practice

__global__ void coalescedMultiply(float *a, float* b, float *c, 

								  int N) 

{ 

  __shared__ float aTile[TILE_DIM][TILE_DIM]; 

int row = blockIdx.y * blockDim.y + threadIdx.y; 

  int col = blockIdx.x * blockDim.x + threadIdx.x; 

  float sum = 0.0f; 

  aTile[threadIdx.y][threadIdx.x] = a[row*TILE_DIM+threadIdx.x]; 

  for (int i = 0; i < TILE_DIM; i++) { 

	sum += aTile[threadIdx.y][i]* b[i*N+col]; 

  } 

  c[row*N+col] = sum; 

}

Why __syncthreads() barrier synchronisation after aTile[threadIdx.y][threadIdx.x] = a[row*TILE_DIM+threadIdx.x]; in the above kernel is NOT neccessary?

So…, Something to do with the value of TILE_DIM and the number of threads per block spawned…

Ok. blockDim.x, blockDim.y and TILE_DIM= 16. so one block consist of 256 threads.

now lets say warp 0 is running and doing the following:

aTile[threadIdx.y][threadIdx.x] = a[row*TILE_DIM+threadIdx.x];

the above transfer will take some time. Now if we dont put a barrier and start the following instructions

for (int i = 0; i < TILE_DIM; i++) { 

	sum += aTile[threadIdx.y][i]* b[i*N+col]; 

  }

when still the job of data copy is in progress, dont you think I will be getting wrong values in sum? For example the value aTile[14][0] is being used in sum when still it is not loaded into aTile…(or the loading is still in progress)

16 is half a warp size, so threads 0-15 will always work at once on the hardware ! there is no need to sync between them since they all work on the same row which they loaded there is no need for a sync.

A thread with index (threadIdx.y, threadIdx.x) performing the sum uses only values aTile[threadIdx.y][0] through aTile[threadIdx.y][15]. These values are loaded by threads with index (threadIdx.y, 0) through (threadIdx.y, 15). In other words, threads in the same half-warp as the thread doing the sum. This guarantees that the loads have been done (or the the half-warp would still be perfoming the load and not the sum). It doesn’t matter how far the other threads in the block have got because they are loading (and then summing) values of aTile[y] where y <> threadIdx.y.

Thanks for the nice explanation :-)