SGEMV - Coalesced access in 2D grid-block vs 1D grid-block

I have two kernels performing SGEMV for matrix A (MxN) and B(Nx1) resulting in C(Mx1)

2D grid block:

__global__
void version2D(float *A, float *B, float *C, int M, int N, float alpha, float beta){
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    int j = blockIdx.y * blockDim.y + threadIdx.y;
    float tmp = 0.;
    if(i < N and j < M){
        for(int k = 0; k < N; k++){
            tmp += A[j*N + k] * B[k];
        }
        C[j] = alpha * tmp + beta * C[j];
    }
}

called by

        NUMTHREADS = 32;                          
        BLOCKSIZE.x = NUMTHREADS, BLOCKSIZE.y = NUMTHREADS;
        GRIDSIZE.x = CEIL_DIV(N, NUMTHREADS), GRIDSIZE.y = CEIL_DIV(M, NUMTHREADS);

        printf("Grid : {%d, %d, %d} blocks. Blocks : {%d, %d, %d} threads.\n",
        GRIDSIZE.x, GRIDSIZE.y, GRIDSIZE.z, BLOCKSIZE.x, BLOCKSIZE.y, BLOCKSIZE.z);
        version2D<<<GRIDSIZE, BLOCKSIZE>>>(A, B, C, M, N, alpha, beta);

and another version with 1D grid-block:

__global__
void version1D(float *A, float *B, float *C, int M, int N, float alpha, float beta){
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    float tmp = 0.;
    if(i < M){
        for(int k = 0; k < N; k++){
            tmp += A[i*N + k] * B[k];
        }
        C[i] = alpha * tmp + beta * C[i];        
    }
}

called by

        NUMTHREADS = 32;
        BLOCKSIZE.x = NUMTHREADS;
        GRIDSIZE.x = CEIL_DIV(M, NUMTHREADS);
        printf("Calling coalesced access 1D kernel\n");
        version1D<<<GRIDSIZE, BLOCKSIZE>>>(A, B, C, M, N, alpha, beta);

The 2D grid-block version achieves 0.09% of cuBLAS and 1D grid-block achieves 33.7% of cuBLAS. The 1D version still does not essentially “coalesce” since each thread is exclusively taking care of every “row” of the matrix A. However there is coalescing with respect to B for the 1D grid version. Where else does the performance gain come from for the 1D version?

Your 2D version isn’t a very sensible usage of resources. Apart from limits checking, the entire kernel design is independent of your i variable. What does that mean? It means that every thread along the x dimension of your threadblock is doing identical work. Since you have 32 threads in the x-dimension of each block, you are having 32 threads compute an identical result, and each one tries to write to C[j] in the final step.

Yes, that is going to be inefficient. It’s not a recommended kernel design strategy.