Hi, I am using CUDA11.6 and NCU 2022.1.1.
And I am looking at “sm__inst_executed_pipe_tensor_op_.hmma.avg.pct_of_peak_sustained_active”.
It compiled with “nvcc wmma.cu --expt-relaxed-constexpr -gencode=arch=compute_75,code="sm_75,compute_75" -o wmma” and run on a 2080Ti. The launch parameter is <<<10000, 256>>>.
I got 24.6% for
#pragma unroll
for (int i = 0; i < 200; i++) {
wmma::fragment<wmma::matrix_b, 16, 16, 16, __half, wmma::col_major> b_frag;
wmma::load_matrix_sync(b_frag, B, 16);
wmma::mma_sync(acc_frag, a_frag, b_frag, acc_frag);
}
I got 50% for
wmma::fragment<wmma::matrix_b, 16, 16, 16, __half, wmma::col_major> b_frag;
wmma::load_matrix_sync(b_frag, B, 16);
#pragma unroll
for (int i = 0; i < 200; i++) {
wmma::mma_sync(acc_frag, a_frag, b_frag, acc_frag);
}