Hello,
I wanted to explore the tensor cores and use them to multiply some matrices generated in a kernel. Now I came across this line from the docs:
C++ warp matrix operations leverage Tensor Cores to accelerate matrix problems of the form D=AxB+C. These operations are supported on mixedprecision floating point data for devices of compute capability 7.0 or higher. This requires cooperation from all threads in a warp. In addition, these operations are allowed in conditional code only if the condition evaluates identically across the entire warp, otherwise the code execution is likely to hang.
This is probably very basic but I am still confused. What exactly does it mean? When I define the matrices for the multiplication in a CUDA kernel:
wmma::fragment<wmma::matrix_a, 16, 16, 16, half, wmma::col_major> matA;
wmma::fragment<wmma::matrix_b, 16, 16, 16, half, wmma::row_major> matB;
wmma::fragment<wmma::accumulator, 16, 16, 16, float> matAcc;
wmma::fill_fragment(matAcc, 0.0f);
//fill matA and matB with data
...
wmma::mma_sync(matAcc, matA, matB, matAcc);
My main questions are:

Are these matrices shared in the whole warp similarly to the principles of warplevel primitives?

Where are the data of the matrices stored anyway? Shared memory? Thread registers (would be too many)?

I wanted to do this simple thing. Each thread fills the matrices and then calls the multiplication. Will I get a unique result for each thread? Is this a good practice? From what I understood is that the multiplications will happen sequentially withing the warp  so in 32 steps. Is that right?
Please excuse my lack of understanding, I havenâ€™t found many sources or examples that would clearly answer these questions. Thank you for your time!