Concurrent execution of CUDA and Tensor cores

Robert_Crovella · August 5, 2022, 2:40pm

Generally no, not correct. There are at least 2 factors to consider:

All instructions are pipelined and have latency. For example, a multiply instruction issued in cycle 0 may not produce a result until e.g. cycle 4. Likewise for tensor core op (wmma). So if an ordinary multiply is issued in cycle 0, producing its result in cycle 4, and a tensor core op is issued in cycle 1, producing its result in cycle 5 (for the sake of discussion), then during cycles 2 and 3 the SM (functional units) are actively involved in processing both ops at the same time.
Many modern SMs are broken into sub-partitions. Each sub-partition has a warp scheduler. So in the exact same cycle it is possible for a warp scheduler in sub-partition 0 to issue a tensor core op, while a warp scheduler in sub-partition 1 issues an ordinary multiply. These instructions would target separate functional units, of course.