I am learning mutrix multiply with cuda. And I wrote a kernel to caculate the product of matrix A and matrix B. The height of A is M(4096 * 3), the width of A is K(4096 * 4). And the height of B is K(4096 * 4), the width of B is N(4096 * 2). The result of the matrix multiplication is C with M * N elements. I used the two share memory matrix As with size of BLOCK_SIZE_M * BLOCK_SIZE_K, matrix Bs with size of BLOCK_SIZE_K * BLOCK_SIZE_N. When set BLOCK_SIZE_M, BLOCK_SIZE_K, BLOCK_SIZE_N the same, I got the correct result. But if the BLOCK_SIZEs are not the same, I got the wrong number. I set the gridPerBlock and threadPerBlock as fllowing,

```
dim3 dimBlock(BLOCK_SIZE_N, BLOCK_SIZE_M);
dim3 dimGrid((B.width + BLOCK_SIZE_N - 1) / BLOCK_SIZE_N,
(A.height + BLOCK_SIZE_M - 1) / BLOCK_SIZE_M);
```

here is my kernel,

```
typedef struct{
int width;
int height;
int stride;
float* elements;
}Matrix;
__global__ void MatMulKernel(const Matrix A, const Matrix B, Matrix C){
int row = blockDim.y * blockIdx.y + threadIdx.y;
int col = blockDim.x * blockIdx.x + threadIdx.x;
int ty = threadIdx.y;
int tx = threadIdx.x;
__shared__ float As[BLOCK_SIZE_M][BLOCK_SIZE_K];
__shared__ float Bs[BLOCK_SIZE_K][BLOCK_SIZE_N];
float CValue = 0;
for(int m = 0; m < A.width / BLOCK_SIZE_K; ++m){
As[ty][tx] = A.elements[row * A.width + m * BLOCK_SIZE_K + tx];
Bs[ty][tx] = B.elements[(m * BLOCK_SIZE_K + ty) * B.width + col];
__syncthreads();
#pragma unroll
for(int k = 0; k < BLOCK_SIZE_K; ++k){
CValue += As[ty][k] * Bs[k][tx];
}
__syncthreads();
}
C.elements[row * B.width + col] = CValue;
}
```

For example, all of the elements of matrix A is 1, and all of the elements of matrix B is 2, if i set the BLOCK_SIZE_M=32, BLOCK_SIZE_K=32, BLOCK_SIZE_N=32, i got the right result is 32768. But if i set the BLOCK_SIZE_M=64, BLOCK_SIZE_K=32, BLOCK_SIZE_N=32, i got the wrong result. So, what is the problem?

Am i miss something?