Hello I had read that Each Tensor Core provides a 4x4x4 matrix processing array which performs the operation D = A * B + C
I have a use case where this seems perfect but i need to multiply thousands of 4 by 4 matrices and accumulate results
Yet i see that cuda seem to supports only 16 by 16 matrix fragments so can I use tensor core for 4 by 4 matrices (in other way than simply add padding?