Overlapping CUDA Cores and Tensor Cores

Hello, I am new to CUDA and I have a question regarding the overlap of CUDA Cores (for ALU/FMA) and Tensor Cores.

I have seen some posts suggesting that these two operations can be overlapped because they use different execution/hardware units. I am currently writing a CUDA program and I want to pipeline these operations in order to overlap them. Can you please guide me on how to figure out if they are actually overlapped or not? Any help would be greatly appreciated. Thank you.

Any help is appreciated, thank you so much!

It is possible to interleave CUDA Core (alu/fma) instructions with Tensor Core (mma) instructions; however, it is easier to have different warps on the SM sub-partition (warp scheduler) issuing CUDA Core instructions and a matrix multiply warp issuing the Tensor Core instructions. A single warp per sub-partition can be designed to reach 100% SOL of the Tensor Cores. See Cutlass documentation on Warp Specialization to understand this design pattern.

Thank you so much for the response! A few quick follow-up questions.

My use case first applies some pre-preprocessing (through CUDA cores) before sending them to Tensor Cores. The workload between these could be different (e.g., the former is cheaper in general). Would warp specialization still be a good strategy in this case?

More importantly though, I’m primarily targeting Ampere GPUs (at least, I want the code to be compatible with Ampere GPUs). AFAIK, Warp Specialization is for Hopper only?