Hello,
I am attempting to use the tensor cores efficiently in a custom DL inference kernel, but I get very poor performance. In the hope that someone here can help me understand what I am doing wrong, I will post a small repro-case here.
__global__ void test_wmma( __half* d_A, __half* d_B, __half* d_C )
{
__shared__ __half B[16*16];
wmma::fragment<wmma::matrix_a, 16, 16, 16, __half, wmma::row_major> a_frag;
wmma::fragment<wmma::matrix_b, 16, 16, 16, __half, wmma::row_major> b_frag;
wmma::fragment<wmma::accumulator, 16, 16, 16, __half> acc_frag;
wmma::fill_fragment( acc_frag, 0.0f );
wmma::load_matrix_sync( a_frag, d_A, 16 );
wmma::load_matrix_sync( b_frag, B, 16 );
wmma::mma_sync( acc_frag, a_frag, b_frag, acc_frag );
wmma::store_matrix_sync( d_C, acc_frag, 16, wmma::mem_row_major );
}
void TestWMMA()
{
vector<dtype> A( 16 * 16 );
vector<dtype> B( 16 * 16 );
vector<dtype> C( 16 * 16 );
for ( auto& a : A ) a = float( rand() ) / RAND_MAX;
for ( auto& b : B ) b = float( rand() ) / RAND_MAX;
dtype *d_A, *d_B, *d_C;
cudaMalloc( &d_A, 16 * 16 * sizeof( dtype ) );
cudaMalloc( &d_B, 16 * 16 * sizeof( dtype ) );
cudaMalloc( &d_C, 16 * 16 * sizeof( dtype ) );
cudaMemcpy( d_A, A.data(), 16 * 16 * sizeof( dtype ), cudaMemcpyHostToDevice );
cudaMemcpy( d_B, B.data(), 16 * 16 * sizeof( dtype ), cudaMemcpyHostToDevice );
cudaMemcpy( d_C, C.data(), 16 * 16 * sizeof( dtype ), cudaMemcpyHostToDevice );
int threads = 256;
int blocks = 10000;
test_wmma<<<blocks, threads>>>( d_A, d_B, d_C );
}
So the kernel simply multiplies two matrices, one from global memory and the other from shared, and stores the result in shared memory. (For simpler timing, no results are written, but I have verified that the SASS code is correct and does not optimize away anything).
I would have expected this to run very quickly, as the global memory access is always cached (which I have verified in Nsight: 99.98% L1 hitrate). However, when profiling the code in Nsight, it takes 145us (very approximately 2TFlops if I didn’t mess up my calculations).
Looking at the “Scheduler Statistics” in nsight, I find that 6 warps are active, but only 0.45 are eligible and 0.18 are issued. The largest stall reason by far is “Stall LG throttle” (~18 cycles/intruction), and looking at the source I find that it stalls mainly on the “load_matrix_sync(…d_A…)” instruction. If I switch so that the A matrix is also in shared, it stalls in the same place, but reports “MIO throttle” as the main offender. By choosing different combinations of row/col major for the matrices, I can make it change the stall reasons around a bit, but I cannot get issued warps over 0.2.
What is wrong here? How do I come anywhere near the 100 TFlops (or whatever) that cublas manages? Any tips or thoughts are very welcome!
(This is on a GTX2070 (sm 7.5) mobile card (Razer Blade 15))