Low Shared Memory Efficiency when all threads in a warp read the same shr. mem. location

Hi everybody,

I have a problem with one of my kernels. Please, don’t get surprised about magic numbers that you can see because the kernel is generated with a script.

 #include "gemmgen_aux.h"
__global__ void kernel_gemm(const float * A, const float * B, float * C, unsigned NumElements) {
  if ((threadIdx.y + blockDim.y * blockIdx.x) < NumElements) {
    const float* MatA = &A[(threadIdx.y + blockDim.y * blockIdx.x) * 504 + 0];
    const float* GlobMatB = &B[(threadIdx.y + blockDim.y * blockIdx.x) * 81 + 0];
    float* MatC = &C[(threadIdx.y + blockDim.y * blockIdx.x) * 504 + 0];
    __shared__ float Scratch[324];
    float* ShrMatB = &Scratch[threadIdx.y * 81];

    ShrMatB[threadIdx.x + 0] = GlobMatB[threadIdx.x + 0];
    if (threadIdx.x < 25) {
      ShrMatB[threadIdx.x + 56] = GlobMatB[threadIdx.x + 56];
    }
    __syncthreads();

    float Results[9] = {0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f};
    float Value;

    for (int k = 0; k < 9; ++k) {
      Value = MatA[threadIdx.x + 56 * k];

      #pragma unroll
      for (int n = 0; n < 9; ++n) {
        Results[n] += Value * ShrMatB[k + 9 * n];
      }
    }

    #pragma unroll
    for (int n = 0; n < 9; ++n) {
      MatC[threadIdx.x + 56 * n] = Results[n];
    }
  }
}
void gemm(const float * A, const float * B, float * C, unsigned NumElements) {
  dim3 Block(56, 4, 1);
  dim3 Grid((NumElements + 4 - 1) / 4, 1, 1);
  kernel_gemm<<<Grid,Block>>>(A, B, C, NumElements);
  CHECK_ERR;
}

As you can see (in the middle), all threads in a warp access the same shared memory location and I personally expect to observe a broadcast. I mean there should be one transaction from the shared memory instead of serialized memory access caused by a bank conflict. At least, it is mentioned in the documentation and in some online tutorials. However, I observe a “Low Shared Memory Efficiency” warning from nvvp . It says: “ kernel accounting for 100% of compute have low efficiency (5.3% avg) ”. Could anybody tell me what exactly wrong with my shared memory access? Is this because Im using 56 threads and one warp is partially masked?

My hardware is GeForce GTX 1050 and I have Cuda 10.1 installed.

Thanks in advance!