I have two kernels performing SGEMV for matrix A (MxN) and B(Nx1) resulting in C(Mx1)

2D grid block:

```
__global__
void version2D(float *A, float *B, float *C, int M, int N, float alpha, float beta){
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
float tmp = 0.;
if(i < N and j < M){
for(int k = 0; k < N; k++){
tmp += A[j*N + k] * B[k];
}
C[j] = alpha * tmp + beta * C[j];
}
}
```

called by

```
NUMTHREADS = 32;
BLOCKSIZE.x = NUMTHREADS, BLOCKSIZE.y = NUMTHREADS;
GRIDSIZE.x = CEIL_DIV(N, NUMTHREADS), GRIDSIZE.y = CEIL_DIV(M, NUMTHREADS);
printf("Grid : {%d, %d, %d} blocks. Blocks : {%d, %d, %d} threads.\n",
GRIDSIZE.x, GRIDSIZE.y, GRIDSIZE.z, BLOCKSIZE.x, BLOCKSIZE.y, BLOCKSIZE.z);
version2D<<<GRIDSIZE, BLOCKSIZE>>>(A, B, C, M, N, alpha, beta);
```

and another version with 1D grid-block:

```
__global__
void version1D(float *A, float *B, float *C, int M, int N, float alpha, float beta){
int i = blockIdx.x * blockDim.x + threadIdx.x;
float tmp = 0.;
if(i < M){
for(int k = 0; k < N; k++){
tmp += A[i*N + k] * B[k];
}
C[i] = alpha * tmp + beta * C[i];
}
}
```

called by

```
NUMTHREADS = 32;
BLOCKSIZE.x = NUMTHREADS;
GRIDSIZE.x = CEIL_DIV(M, NUMTHREADS);
printf("Calling coalesced access 1D kernel\n");
version1D<<<GRIDSIZE, BLOCKSIZE>>>(A, B, C, M, N, alpha, beta);
```

The 2D grid-block version achieves 0.09% of cuBLAS and 1D grid-block achieves 33.7% of cuBLAS. The 1D version still does not essentially “coalesce” since each thread is exclusively taking care of every “row” of the matrix A. However there is coalescing with respect to B for the 1D grid version. Where else does the performance gain come from for the 1D version?