Why barrier synchronisation not needed here from best practices guide page number 29

gpuguy · June 23, 2010, 7:27am

From best practices guide page number 29: best practice

__global__ void coalescedMultiply(float *a, float* b, float *c, 

								  int N) 

{ 

  __shared__ float aTile[TILE_DIM][TILE_DIM]; 

int row = blockIdx.y * blockDim.y + threadIdx.y; 

  int col = blockIdx.x * blockDim.x + threadIdx.x; 

  float sum = 0.0f; 

  aTile[threadIdx.y][threadIdx.x] = a[row*TILE_DIM+threadIdx.x]; 

  for (int i = 0; i < TILE_DIM; i++) { 

	sum += aTile[threadIdx.y][i]* b[i*N+col]; 

  } 

  c[row*N+col] = sum; 

}

Why __syncthreads() barrier synchronisation after aTile[threadIdx.y][threadIdx.x] = a[row*TILE_DIM+threadIdx.x]; in the above kernel is NOT neccessary?

tera · June 23, 2010, 9:00pm

Sarnath · June 24, 2010, 6:12am

So…, Something to do with the value of TILE_DIM and the number of threads per block spawned…

gpuguy · June 24, 2010, 6:45am

Ok. blockDim.x, blockDim.y and TILE_DIM= 16. so one block consist of 256 threads.

now lets say warp 0 is running and doing the following:

aTile[threadIdx.y][threadIdx.x] = a[row*TILE_DIM+threadIdx.x];

the above transfer will take some time. Now if we dont put a barrier and start the following instructions

for (int i = 0; i < TILE_DIM; i++) { 

	sum += aTile[threadIdx.y][i]* b[i*N+col]; 

  }

when still the job of data copy is in progress, dont you think I will be getting wrong values in sum? For example the value aTile[14][0] is being used in sum when still it is not loaded into aTile…(or the loading is still in progress)

erdooom · June 24, 2010, 8:12am

16 is half a warp size, so threads 0-15 will always work at once on the hardware ! there is no need to sync between them since they all work on the same row which they loaded there is no need for a sync.

shawkie · June 24, 2010, 8:17am

Ok. blockDim.x, blockDim.y and TILE_DIM= 16. so one block consist of 256 threads.

now lets say warp 0 is running and doing the following:
aTile[threadIdx.y][threadIdx.x] = a[row*TILE_DIM+threadIdx.x];
the above transfer will take some time. Now if we dont put a barrier and start the following instructions
for (int i = 0; i < TILE_DIM; i++) { 

	sum += aTile[threadIdx.y][i]* b[i*N+col]; 

  }
when still the job of data copy is in progress, dont you think I will be getting wrong values in sum? For example the value aTile[14][0] is being used in sum when still it is not loaded into aTile…(or the loading is still in progress)

A thread with index (threadIdx.y, threadIdx.x) performing the sum uses only values aTile[threadIdx.y][0] through aTile[threadIdx.y][15]. These values are loaded by threads with index (threadIdx.y, 0) through (threadIdx.y, 15). In other words, threads in the same half-warp as the thread doing the sum. This guarantees that the loads have been done (or the the half-warp would still be perfoming the load and not the sum). It doesn’t matter how far the other threads in the block have got because they are loading (and then summing) values of aTile[y] where y <> threadIdx.y.

gpuguy · June 25, 2010, 5:49am

Thanks for the nice explanation :-)