Why are the block sizes setted the same in the matrix multiply?

I am learning mutrix multiply with cuda. And I wrote a kernel to caculate the product of matrix A and matrix B. The height of A is M(4096 * 3), the width of A is K(4096 * 4). And the height of B is K(4096 * 4), the width of B is N(4096 * 2). The result of the matrix multiplication is C with M * N elements. I used the two share memory matrix As with size of BLOCK_SIZE_M * BLOCK_SIZE_K, matrix Bs with size of BLOCK_SIZE_K * BLOCK_SIZE_N. When set BLOCK_SIZE_M, BLOCK_SIZE_K, BLOCK_SIZE_N the same, I got the correct result. But if the BLOCK_SIZEs are not the same, I got the wrong number. I set the gridPerBlock and threadPerBlock as fllowing,

dim3 dimBlock(BLOCK_SIZE_N, BLOCK_SIZE_M);
dim3 dimGrid((B.width + BLOCK_SIZE_N - 1) / BLOCK_SIZE_N,
                 (A.height + BLOCK_SIZE_M - 1) / BLOCK_SIZE_M);

here is my kernel,

typedef struct{
  int width;
  int height;
  int stride;
  float* elements;
}Matrix;

__global__ void MatMulKernel(const Matrix A, const Matrix B, Matrix C){

  int row = blockDim.y * blockIdx.y + threadIdx.y;
  int col = blockDim.x * blockIdx.x + threadIdx.x;

  int ty = threadIdx.y;
  int tx = threadIdx.x;

  __shared__ float As[BLOCK_SIZE_M][BLOCK_SIZE_K];
  __shared__ float Bs[BLOCK_SIZE_K][BLOCK_SIZE_N];

  float CValue = 0;
  for(int m = 0; m < A.width / BLOCK_SIZE_K; ++m){

    As[ty][tx] = A.elements[row * A.width + m * BLOCK_SIZE_K + tx];
    Bs[ty][tx] = B.elements[(m * BLOCK_SIZE_K + ty) * B.width + col];

    __syncthreads();

    #pragma unroll
    for(int k = 0; k < BLOCK_SIZE_K; ++k){
      CValue += As[ty][k] * Bs[k][tx];
    }

    __syncthreads();

  }
  C.elements[row * B.width + col] = CValue;
}

For example, all of the elements of matrix A is 1, and all of the elements of matrix B is 2, if i set the BLOCK_SIZE_M=32, BLOCK_SIZE_K=32, BLOCK_SIZE_N=32, i got the right result is 32768. But if i set the BLOCK_SIZE_M=64, BLOCK_SIZE_K=32, BLOCK_SIZE_N=32, i got the wrong result. So, what is the problem?
Am i miss something?

unless your block dimensions match the shared memory dimensions for each array:

__shared__ float As[BLOCK_SIZE_M][BLOCK_SIZE_K];
__shared__ float Bs[BLOCK_SIZE_K][BLOCK_SIZE_N];

you won’t get a proper load of shared memory here:

As[ty][tx] = A.elements[row * A.width + m * BLOCK_SIZE_K + tx];
Bs[ty][tx] = B.elements[(m * BLOCK_SIZE_K + ty) * B.width + col];

Each thread loads one element of As and one element of Bs. In order to make sure that all elements of As and all elements of Bs are loaded, the dimensions of As must match the block dimensions exactly, and the dimensions of Bs must match the block dimensions exactly.

Unless BLOCK_SIZE_K is the same as both BLOCK_SIZE_M and BLOCK_SIZE_N, you won’t have satisfied this requirement.