I am doing blocked matrix multiplication in CUDA without using any shared memory. So I am performing the blocking on a global memory. Inside my kernel code I have declared a 2D array for temporary storage of results. The kernel code is as follows :

```
__global__ void Mat_mul_kernel_blocked(float *mat_a,float *mat_b,float
*mat_res,int size)
{
int tx = threadIdx.x, ty = threadIdx.y,i,j,k,j1;
int row = ty*block_factor +blockIdx.y*block_factor*TILE;
int col = tx*block_factor +blockIdx.x*block_factor*TILE;
float regs[block_factor*block_factor];
float temp;
for(i=0;i<block_factor;i++)
{ for(j=0;j<block_factor;j++)
{ regs[i*block_factor+j]=0;
}
}
for(i=0;i<(size/block_factor);i++)
{
for(j=0;j<block_factor;j++)
{
for(j1=0;j1<block_factor;j1++)
{
for(k=0;k<block_factor;k++)
{ temp+= mat_a[(row+j)*size+(i*block_factor)+k]*mat_b[(i*block_factor+k)*size+j1];
}
regs[j*block_factor+j1] += temp;
temp =0;
}
}
}
for(i=0;i<block_factor;i++)
{ for(j=0;j<block_factor;j++)
{ mat_res[(row+i)*size+(col+j)] =regs[i*block_factor+j] ;
}
}
}
```

I have declared the block_factor and thread block size as

```
#define block_factor 4
#define TILE 32
```

My matrix size is 16384 and my grid and block configuration is as follows :

```
dim3 gridDim_1(size/(block_factor*TILE),size/(block_factor*TILE),1);
dim3 blockDim(TILE, TILE, 1);
```

I am facing a very unusual problem. I am setting the

```
block_factor
```

in power of 2 (i.e. 2,4,8,16,…). When my

```
block_factor <=4
```

I am getting a correct result. But when I am increasing my

```
block_factor>4
```

I am getting a wrong result. I have no clue behind this and why this is happening. I am using Tesla K40. I would really appreciate if someone could help me out regarding this issue. Thank you very much for everyone’s support.