when we used tensor core in cuda, we used wmma. we load memory to tensor core
wmma::load_matrix_sync(a_frag, a + a_col + a_row * K, K);
wmma::load_matrix_sync(b_frag, b + b_col + b_row * N, N);
wmma::mma_sync(ab_frag, a_frag, b_frag, ab_frag);
I first used tensor core, I wonder where tiled matrix is accumulate in. I can’t see which memory accumulate tiled matrix data tensor core used. And how can I profile wmma::accumulater 's memory used?
environment:
OS: Ubuntu 18.04
GPU:RTX 3080
Language: C++
Cuda version: cuda 12.0