Better performance with smaller block size & bank conflicts !!!

I’m trying to do matrix multiplication A=W*B using shared memory. My code is below:

__global__ void forward (float* B, float* A, float* W, int Wcols, int Wrows, int Bcols){
	int row=blockDim.y*blockIdx.y+threadIdx.y;
	int col=blockDim.x*blockIdx.x+threadIdx.x;
	int i, j;

	float c=0;
	for (i=0; i<(MAXBLOCKSIZE + Wcols - 1)/MAXBLOCKSIZE; i++){
		 if (i*MAXBLOCKSIZE+ threadIdx.x < Wcols && row < Wrows)
                     _W[threadIdx.y][threadIdx.x] = W[row*Wcols + i*MAXBLOCKSIZE+ threadIdx.x];
                     _W[threadIdx.y][threadIdx.x] = 0.0;
                 if (i*MAXBLOCKSIZE+threadIdx.y < Wcols && col<Bcols)
                     _B[threadIdx.y][threadIdx.x] = B[(i*MAXBLOCKSIZE+threadIdx.y)*Bcols+col];
                     _B[threadIdx.y][threadIdx.x] = 0.0;

		for (j=0; j<MAXBLOCKSIZE; j++){
        if (row < Wrows && col<Bcols){


The problem is that my code is faster when I use 16x16 blocks and slower when I use 32x32 blocks. Why does this happen, since using 16x16 blocks also creates 16-way bank conflicts when accessing _W[threadIdx.y][j] ?

It might be occupancy. That can’t be confirmed without a compilable code.

Serious efforts at performance analysis can’t be done without a compilable code, IMO.

You might also want to look into the concept of padding to minimize shared-memory bank conflicts.