Divide the matrix multiplication task into two parts, create two CUDA streams, one utilizing Tensor Cores for computation and the other using CUDA Cores for computation. Can Tensor Cores and CUDA Cores compute simultaneously? How can we prove they are computing simultaneously?
I have read this paper, and it involves simultaneous computation through hardware. However, is there any way to modify the hardware without changing it.