Matrix multiplication performance was decreased due to stall long scoreboard barrier


I am doing matrix multiplication using tiling method. Nsight profiler warp state statics was suggested

[Warning] On average, each warp of this kernel spends 6.5 cycles being stalled waiting for a scoreboard dependency on a L1TEX (local, global, surface, texture) operation. This represents about 36.4% of the total average of 17.9 cycles between issuing two instructions. To reduce the number of cycles waiting on L1TEX data accesses verify the memory access patterns are optimal for the target architecture, attempt to increase cache hit rates by increasing data locality or by changing the cache configuration, and consider moving frequently used data to shared memory

I saw the source counter, it is showing copying from global memory to shared memory is the reason for stall Long Scorecard. Can anyone help how to do faster copy from global to shared memory? Or any other method to do the matrix multiplication faster. My code was given below.

#define TILE_WIDTH 16
__global__ void matrixMultiplyShared
      const float * __restrict__ A,
	  const float * __restrict__ B,
	  float * C, int numARows, 
	  int numAColumns,
	  int numBRows, 
	  int numBColumns,
      int numCRows, 
	  int numCColumns) 
     __shared__ float sharedA[TILE_WIDTH][TILE_WIDTH];
     __shared__ float sharedB[TILE_WIDTH][TILE_WIDTH];
    int bx = blockIdx.x;
    int by = blockIdx.y;
    int tx = threadIdx.x;
    int ty = threadIdx.y;
    int Row = by*TILE_WIDTH + ty;
    int Col = bx*TILE_WIDTH + tx;

    float Cvalue = 0.0, temp_A, temp_B; 
    if (numAColumns != numBRows) 
       return ;
    int aBegin = numAColumns * TILE_WIDTH * by;
    int aEnd = aBegin + numAColumns - 1;
    int aStep = TILE_WIDTH;
    int bBegin = TILE_WIDTH * bx;
    int bStep = TILE_WIDTH * numBColumns;

    for (int a = aBegin, b = bBegin;
       a <= aEnd;
       a += aStep, b += bStep) {
          sharedA[ty][tx] = A[a + numAColumns * ty + tx];
          sharedB[ty][tx] = B[b + numBColumns * ty + tx];


       for (int k = 0; k < TILE_WIDTH; ++k)
          Cvalue += sharedA[ty][k] * sharedB[k][tx];


    if (Row < numCRows && Col < numCColumns)
         int c = numBColumns * TILE_WIDTH * by + TILE_WIDTH * bx;
         C[c + numBColumns * ty + tx] = Cvalue;

I am using Blocks size : 16,1,1 , Threads per block: 16, 16, 1

Global memory accesses take time, there is no way around it. How about using an existing library for matrix multiplication? cuBLAS :: CUDA Toolkit Documentation

If you want to write your own kernel, there are many resources on that topic. In general, fast matrix multiplication decomposes the matrix in different parts in shared memory and registers for better memory reuse. You may find this post interesting. CUTLASS: Fast Linear Algebra in CUDA C++ | NVIDIA Developer Blog