Why Low Tensor Pipe Utilization

Hi, I am using CUDA11.6 and NCU 2022.1.1.
And I am looking at “sm__inst_executed_pipe_tensor_op_.hmma.avg.pct_of_peak_sustained_active”.

It compiled with “nvcc wmma.cu --expt-relaxed-constexpr -gencode=arch=compute_75,code="sm_75,compute_75" -o wmma” and run on a 2080Ti. The launch parameter is <<<10000, 256>>>.

I got 24.6% for

#pragma unroll
  for (int i = 0; i < 200; i++) {
    wmma::fragment<wmma::matrix_b, 16, 16, 16, __half, wmma::col_major> b_frag;
    wmma::load_matrix_sync(b_frag, B, 16);
    wmma::mma_sync(acc_frag, a_frag, b_frag, acc_frag);
  }

I got 50% for

  wmma::fragment<wmma::matrix_b, 16, 16, 16, __half, wmma::col_major> b_frag;
  wmma::load_matrix_sync(b_frag, B, 16);
#pragma unroll
  for (int i = 0; i < 200; i++) {
    wmma::mma_sync(acc_frag, a_frag, b_frag, acc_frag);
  }