Trying to understand why Sectors/Req in wmma_example is 8 Sec/Req

I have a kernel which has structure for doing multiplication to that of wmma_example from code-samples/posts/tensor-cores/simpleTensorCoreGEMM.cu at master · NVIDIA-developer-blog/code-samples · GitHub, and I’m trying to understand why the Nvidia Nsight Compute is complaining about excessive loads.

__global__ void wmma_example(half *a, half *b, float *c, int M, int N, int K, float alpha, float beta) {
   // Leading dimensions. Packed with no transpositions.
   int lda = M;
   int ldb = K;
   int ldc = M;

   // Tile using a 2D grid
   int warpM = (blockIdx.x * blockDim.x + threadIdx.x) / warpSize;
   int warpN = (blockIdx.y * blockDim.y + threadIdx.y);

   // Declare the fragments
   wmma::fragment<wmma::matrix_a, WMMA_M, WMMA_N, WMMA_K, half, wmma::col_major> a_frag;
   wmma::fragment<wmma::matrix_b, WMMA_M, WMMA_N, WMMA_K, half, wmma::col_major> b_frag;
   wmma::fragment<wmma::accumulator, WMMA_M, WMMA_N, WMMA_K, float> acc_frag;
   wmma::fragment<wmma::accumulator, WMMA_M, WMMA_N, WMMA_K, float> c_frag;

   wmma::fill_fragment(acc_frag, 0.0f);

   // Loop over k
   for (int i = 0; i < K; i += WMMA_K) {
      int aRow = warpM * WMMA_M;
      int aCol = i;

      int bRow = i;
      int bCol = warpN * WMMA_N;

      // Bounds checking
      if (aRow < M && aCol < K && bRow < K && bCol < N) {
         // Load the inputs
         wmma::load_matrix_sync(a_frag, a + aRow + aCol * lda, lda);
         wmma::load_matrix_sync(b_frag, b + bRow + bCol * ldb, ldb);

         // Perform the matrix multiplication
         wmma::mma_sync(acc_frag, a_frag, b_frag, acc_frag);

      }
   }

   // Load in the current value of c, scale it by beta, and add this our result scaled by alpha
   int cRow = warpM * WMMA_M;
   int cCol = warpN * WMMA_N;

   if (cRow < M && cCol < N) {
      wmma::load_matrix_sync(c_frag, c + cRow + cCol * ldc, ldc, wmma::mem_col_major);

#pragma unroll
      for(int i=0; i < c_frag.num_elements; i++) {
         c_frag.x[i] = alpha * acc_frag.x[i] + beta * c_frag.x[i];
      }

      // Store the output
      wmma::store_matrix_sync(c + cRow + cCol * ldc, c_frag, ldc, wmma::mem_col_major);
   }
}

More importantly I want to look at the first loop, in which for instructions wmma::load_matrix_sync seem to be the place which are generating excessive loads. Now from my understanding, a warp should be bringing in 16x16 half-float elements into the register file. And since the warp steps to the immediately consecutive column, it should be coalesced. So, why does Nsight complain about only using 16 out of 32 bytes? Isn’t the memory access coalesced throughout the kernel?

A 16x16 fragment of half quantities will have a dimension of 32 bytes by 16 rows. Given the information in the example you linked (overall dimensions divisible by 16) we can be certain that these rows will each align to one sector, but the sector corresponding to one row is not adjacent to the sector corresponding to the next row, if the overall matrix is larger than 16x16. If we imagine that this is a 16x16 tile selected out of a larger matrix (which is the case in the example you linked) then there is a stride of the matrix width, from one row of the tile to the next row of the tile (i.e. the adjacent rows are not adjacent in memory).

Nsight compute seems to be saying that there are 8 sectors per request. This presumably means that each request (instruction) is loading 8 rows of the fragment. That works out to 256 bytes per request, and I would generally expect that to be broken into at least 2 wavefronts per request.

None of that seems surprising to me.

The low level machine behavior here (including load patterns) varies depending on what GPU architecture you compile for. So a precise analysis, ie. looking at the SASS, could not be done without knowing what type of GPU you are running on. For example, when I compile for cc8.9, I do not witness 256 byte wide loads per warp (LDG.64), I see mostly 128 byte wide loads per warp. If I compile for cc7.0 I see a mix of 256 byte wide loads (LDG.64) and 512 byte wide loads (LDG.128). So I’m fairly confident the exact patterns and ratios you see in nsight compute will depend on the GPU.

But multiple sectors per request isn’t surprising. The only way to avoid that would be to load only 32 bytes per instruction, which would generally be inefficient on a GPU.

I acknowledge you have mentioned things like:

and

but you haven’t actually shown the output from nsight compute that indicates that. It may require more information than what you have shown here, and some nsight compute questions might need the experts on the nsight compute forum to explain.

I also note that the developers of that sample code took pains to point out that the wmma usage may not be efficient:

  1. They put a print-out in the code stating that
  2. They compare the performance to cublas, which on my GPU runs ~10x faster than the wmma kernel
  3. They explicitly state for performance refer to another sample code.

For people who are interested in linear algebra, I generally don’t recommend writing your own, even if you are using TC ops. Instead, I generally would recommend cublas or CUTLASS.