How to use WMMA efficiently

Hello,
I am attempting to use the tensor cores efficiently in a custom DL inference kernel, but I get very poor performance. In the hope that someone here can help me understand what I am doing wrong, I will post a small repro-case here.

__global__ void test_wmma( __half* d_A, __half* d_B, __half* d_C )
{
	__shared__ __half B[16*16];
	wmma::fragment<wmma::matrix_a, 16, 16, 16, __half, wmma::row_major> a_frag;
	wmma::fragment<wmma::matrix_b, 16, 16, 16, __half, wmma::row_major> b_frag;
	wmma::fragment<wmma::accumulator, 16, 16, 16, __half> acc_frag;
	wmma::fill_fragment( acc_frag, 0.0f );
	wmma::load_matrix_sync( a_frag, d_A, 16 );
	wmma::load_matrix_sync( b_frag, B, 16 );
	wmma::mma_sync( acc_frag, a_frag, b_frag, acc_frag );
	wmma::store_matrix_sync( d_C, acc_frag, 16, wmma::mem_row_major );
}

void TestWMMA()
{
	vector<dtype> A( 16 * 16 );
	vector<dtype> B( 16 * 16 );
	vector<dtype> C( 16 * 16 );
	for ( auto& a : A )	a = float( rand() ) / RAND_MAX;
	for ( auto& b : B )	b = float( rand() ) / RAND_MAX;
	dtype *d_A, *d_B, *d_C;
	cudaMalloc( &d_A, 16 * 16 * sizeof( dtype ) );
	cudaMalloc( &d_B, 16 * 16 * sizeof( dtype ) );
	cudaMalloc( &d_C, 16 * 16 * sizeof( dtype ) );
	cudaMemcpy( d_A, A.data(), 16 * 16 * sizeof( dtype ), cudaMemcpyHostToDevice );
	cudaMemcpy( d_B, B.data(), 16 * 16 * sizeof( dtype ), cudaMemcpyHostToDevice );
	cudaMemcpy( d_C, C.data(), 16 * 16 * sizeof( dtype ), cudaMemcpyHostToDevice );
	int threads = 256;
	int blocks = 10000;
	test_wmma<<<blocks, threads>>>( d_A, d_B, d_C );
}

So the kernel simply multiplies two matrices, one from global memory and the other from shared, and stores the result in shared memory. (For simpler timing, no results are written, but I have verified that the SASS code is correct and does not optimize away anything).

I would have expected this to run very quickly, as the global memory access is always cached (which I have verified in Nsight: 99.98% L1 hitrate). However, when profiling the code in Nsight, it takes 145us (very approximately 2TFlops if I didn’t mess up my calculations).

Looking at the “Scheduler Statistics” in nsight, I find that 6 warps are active, but only 0.45 are eligible and 0.18 are issued. The largest stall reason by far is “Stall LG throttle” (~18 cycles/intruction), and looking at the source I find that it stalls mainly on the “load_matrix_sync(…d_A…)” instruction. If I switch so that the A matrix is also in shared, it stalls in the same place, but reports “MIO throttle” as the main offender. By choosing different combinations of row/col major for the matrices, I can make it change the stall reasons around a bit, but I cannot get issued warps over 0.2.

What is wrong here? How do I come anywhere near the 100 TFlops (or whatever) that cublas manages? Any tips or thoughts are very welcome!

(This is on a GTX2070 (sm 7.5) mobile card (Razer Blade 15))

The construction of your kernel represents a fairly naive usage of the wmma functionality. In particular your paradigm is load-compute-unload. It does not matter how much of this you do, it will not be efficient. For much of the duration of the execution time of your kernel, the tensor core units across the device are idle. In order to get anything approaching full rated performance, it will be necessary to keep these units continuously busy with multiple “back-to-back” operations. This means its necessary to overlap the loading/unloading operations with compute operations, at the kernel level.

You can get a sense of this necessity by reading the CUTLASS documentation

scroll down to “Loop Unrolling” and read that section.

The necessary programming techniques here are pretty involved, and so the recommendation would be to use CUTLASS if possible. Not only are there a lot of flexible building blocks there, but it probably has the best documentation on how to do it:

Thank you for your reply! I realize I oversimplified my test case a bit, and unintentionally left a global memory write in the end.

In my case, CUTLASS does not help much, as I actually need to do 16x16 matrix multiplications (not as part of a larger GEMM) within my kernel. I have gone through the documentation and code, but fail to see the important difference in their approach.

I would still love some insight into why the code below does not give any kind of performance (issue is still at .15). Here, I only load a single matrix from shared in each iteration but my threads still stall entirely on MIO Throttle and Short Scoreboard.

__global__ void test_wmma()
{
	// Warps within a block read from 256 byte aligned strided adresses to avoid 
	// bank conflicts (makes no difference). 
	__shared__ __half smem[1024 * 8];
	__half* A = smem + threadIdx.y * 1024 + threadIdx.y * 16;
	__half* B = smem + threadIdx.y * 1024 + threadIdx.y * 16 + 256;
	__half* C = smem + threadIdx.y * 1024 + threadIdx.y * 16 + 512;

	// Matrix A is read once, and accumulator is filled once. 
	wmma::fragment<wmma::matrix_a, 16, 16, 16, __half, wmma::row_major> a_frag;
	wmma::fragment<wmma::accumulator, 16, 16, 16, __half> acc_frag;
	wmma::fill_fragment( acc_frag, 0.0f );
	wmma::load_matrix_sync( a_frag, A, 16 );

#pragma unroll
	for ( int i = 0; i < 20; i++ )
	{
		wmma::fragment<wmma::matrix_b, 16, 16, 16, __half, wmma::col_major> b_frag;
		wmma::load_matrix_sync( b_frag, B, 16 );
		wmma::mma_sync( acc_frag, a_frag, b_frag, acc_frag );
	}

	wmma::store_matrix_sync( C, acc_frag, 16, wmma::mem_col_major );
}

Very grateful for any help.

When I run the code you have shown now, on a Tesla V100:

$ cat t18.cu
#include <cuda_fp16.h>
#include <mma.h>
using namespace nvcuda;
using namespace std;
__global__ void test_wmma()
{
        // Warps within a block read from 256 byte aligned strided adresses to avoid
        // bank conflicts (makes no difference).
        __shared__ __half smem[1024 * 8];
        __half* A = smem + threadIdx.y * 1024 + threadIdx.y * 16;
        __half* B = smem + threadIdx.y * 1024 + threadIdx.y * 16 + 256;
        __half* C = smem + threadIdx.y * 1024 + threadIdx.y * 16 + 512;

        // Matrix A is read once, and accumulator is filled once.
        wmma::fragment<wmma::matrix_a, 16, 16, 16, __half, wmma::row_major> a_frag;
        wmma::fragment<wmma::accumulator, 16, 16, 16, __half> acc_frag;
        wmma::fill_fragment( acc_frag, 0.0f );
        wmma::load_matrix_sync( a_frag, A, 16 );

#pragma unroll
        for ( int i = 0; i < 20; i++ )
        {
                wmma::fragment<wmma::matrix_b, 16, 16, 16, __half, wmma::col_major> b_frag;
                wmma::load_matrix_sync( b_frag, B, 16 );
                wmma::mma_sync( acc_frag, a_frag, b_frag, acc_frag );
        }

        wmma::store_matrix_sync( C, acc_frag, 16, wmma::mem_col_major );
}
void TestWMMA()
{
        int threads = 256;
        int blocks = 10000;
        test_wmma<<<blocks, threads>>>();
}
int main(){

        TestWMMA();
        cudaDeviceSynchronize();
        TestWMMA();
        cudaDeviceSynchronize();
}
$ nvcc -o t18 t18.cu -arch=sm_70
$ cuda-memcheck ./t18
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
$ nvprof ./t18
==49596== NVPROF is profiling process 49596, command: ./t18
==49596== Profiling application: ./t18
==49596== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  264.31us         2  132.16us  131.87us  132.44us  test_wmma(void)
      API calls:   95.91%  438.94ms         2  219.47ms  14.358us  438.92ms  cudaLaunchKernel
                    2.13%  9.7555ms         8  1.2194ms  1.1207ms  1.3072ms  cuDeviceTotalMem
                    1.73%  7.8957ms       808  9.7710us     263ns  498.41us  cuDeviceGetAttribute
                    0.16%  733.96us         8  91.744us  86.852us  114.31us  cuDeviceGetName
                    0.06%  266.37us         2  133.18us  131.21us  135.16us  cudaDeviceSynchronize
                    0.01%  30.125us         8  3.7650us  1.8050us  10.934us  cuDeviceGetPCIBusId
                    0.00%  8.4330us        16     527ns     361ns  1.0820us  cuDeviceGet
                    0.00%  4.5090us         8     563ns     396ns     808ns  cuDeviceGetUuid
                    0.00%  2.8980us         3     966ns     404ns  2.0260us  cuDeviceGetCount
$

I get a kernel execution time of ~130us.

If I do a flops/s calculation based on that, I believe it should be:

16x16x16x2x20x8x10000/0.000130 = 100,824,615,384,615.3… which is 100TF. That looks pretty good to me.

Thank you very much for checking this. When running the code in isolation, I too get ~100TF on a v100, and about 18TF on my mobile RTX2070. I still get a very low issue rate, but I guess that makes sense cause now the tensor ops dominate the executed instructions.

When running the exact same code in my application, I get 1TF, so I still have a problem, but I don’t feel like I’m going insane anymore :).