2D Array variable within CUDA kernel

I am doing blocked matrix multiplication in CUDA without using any shared memory. So I am performing the blocking on a global memory. Inside my kernel code I have declared a 2D array for temporary storage of results. The kernel code is as follows :

__global__ void Mat_mul_kernel_blocked(float *mat_a,float *mat_b,float   
                                         *mat_res,int size)
    int tx = threadIdx.x, ty = threadIdx.y,i,j,k,j1;

    int row = ty*block_factor +blockIdx.y*block_factor*TILE;
    int col = tx*block_factor +blockIdx.x*block_factor*TILE;

    float regs[block_factor*block_factor];
    float temp;

    { for(j=0;j<block_factor;j++)
      { regs[i*block_factor+j]=0;

                { temp+=    mat_a[(row+j)*size+(i*block_factor)+k]*mat_b[(i*block_factor+k)*size+j1];
                regs[j*block_factor+j1] += temp;
                temp =0;

    { for(j=0;j<block_factor;j++)
      { mat_res[(row+i)*size+(col+j)] =regs[i*block_factor+j] ;


I have declared the block_factor and thread block size as

#define block_factor 4
#define TILE 32

My matrix size is 16384 and my grid and block configuration is as follows :

dim3 gridDim_1(size/(block_factor*TILE),size/(block_factor*TILE),1);
dim3 blockDim(TILE, TILE, 1);

I am facing a very unusual problem. I am setting the


in power of 2 (i.e. 2,4,8,16,…). When my

block_factor <=4

I am getting a correct result. But when I am increasing my


I am getting a wrong result. I have no clue behind this and why this is happening. I am using Tesla K40. I would really appreciate if someone could help me out regarding this issue. Thank you very much for everyone’s support.

what is the value of ‘size’ in your kernel code?

i would hope that it is not 16384, as i fail to convince myself that it can be said figure


Thank you for your reply. The ‘size’ in the figure is 16384 which is the matrix size in one dimension. Thank you.

As you increase the “block_factor” in this code, it is going to increase the per-thread and aggregate local/stack memory usage. It’s quite possible that as you make it larger, your kernels are failing to launch. Are you doing proper cuda error checking?

Have you run your code with cuda-memcheck?

You were given the above advice in your cross-posting here:


but you haven’t responded to it.