Overlapping CUDA Cores and Tensor Cores

It is possible to interleave CUDA Core (alu/fma) instructions with Tensor Core (mma) instructions; however, it is easier to have different warps on the SM sub-partition (warp scheduler) issuing CUDA Core instructions and a matrix multiply warp issuing the Tensor Core instructions. A single warp per sub-partition can be designed to reach 100% SOL of the Tensor Cores. See Cutlass documentation on Warp Specialization to understand this design pattern.