From best practices guide page number 29: best practice
__global__ void coalescedMultiply(float *a, float* b, float *c,
int N)
{
__shared__ float aTile[TILE_DIM][TILE_DIM];
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
float sum = 0.0f;
aTile[threadIdx.y][threadIdx.x] = a[row*TILE_DIM+threadIdx.x];
for (int i = 0; i < TILE_DIM; i++) {
sum += aTile[threadIdx.y][i]* b[i*N+col];
}
c[row*N+col] = sum;
}
Why __syncthreads() barrier synchronisation after aTile[threadIdx.y][threadIdx.x] = a[row*TILE_DIM+threadIdx.x]; in the above kernel is NOT neccessary?
the above transfer will take some time. Now if we dont put a barrier and start the following instructions
for (int i = 0; i < TILE_DIM; i++) {
sum += aTile[threadIdx.y][i]* b[i*N+col];
}
when still the job of data copy is in progress, dont you think I will be getting wrong values in sum? For example the value aTile[14][0] is being used in sum when still it is not loaded into aTile…(or the loading is still in progress)
16 is half a warp size, so threads 0-15 will always work at once on the hardware ! there is no need to sync between them since they all work on the same row which they loaded there is no need for a sync.
A thread with index (threadIdx.y, threadIdx.x) performing the sum uses only values aTile[threadIdx.y][0] through aTile[threadIdx.y][15]. These values are loaded by threads with index (threadIdx.y, 0) through (threadIdx.y, 15). In other words, threads in the same half-warp as the thread doing the sum. This guarantees that the loads have been done (or the the half-warp would still be perfoming the load and not the sum). It doesn’t matter how far the other threads in the block have got because they are loading (and then summing) values of aTile[y] where y <> threadIdx.y.