Concurrent execution of CUDA and Tensor cores

I am trying some DL optimization and want to pipeline tiles of matmul output from Tensor core to following vector ops in cuda core. Can I run tensor core and cuda core concurrently on two different tiles of data in register file/shared memory?

A tensor core instruction is an instruction like any other SASS instruction. A tensor core unit is a unit like any other functional unit in the GPU SM. A warp scheduler can theoretically issue a tensor core instruction in one cycle, and any other instruction (or even another tensorcore instruction) in the next cycle.

There are no inherent scheduling restrictions between tensor core activity and other types of GPU SM activity.

At the SASS level, tensorcore operands, as well as most other types of instruction operands, come from the register file. It is up to the compiler to determine how it will use registers, and how many registers. It’s also up to the compiler to determine how it will order or schedule instructions in the instruction stream.

You have essentially no control over any of this at the CUDA C++ or PTX level, and no tools whatsoever to perform source code programming at the SASS level.

To get a baseline understanding of how GPUs schedule work, I would suggest this series particularly unit 3 or 1 through 4.

Thanks for such a great explanation. Follow up question:

Does it mean that technically nothing prohibits a tensor instruction and a cuda instruction issued back to back as long as register file can keep up but in practice compiler does not generally schedules this way? Is there a way to verify if compiler can schedule instruction such that tensor core and cuda core can operate simultaneously?

Yes, that is what it means.

I don’t know where you got that. If the compiler did not schedule tensor core instructions along with other instructions, what else would it be doing? NOP? Empty space? Maybe you are mixing up what the compiler does and what the warp scheduler does. The warp scheduler can indeed end up with “empty space”. We call that a stall. (Technically, a warp is what will stall. But if all the warps assigned to a warp scheduler are stalled, then that SMSP - SM Sub Partition is effectively “stalled” or unable to issue.)

The CUDA binary utilities allow you to inspect exactly what the compiler did.

So if tensor core instruction and cuda instruction are issued back to back (1 cycle apart) it means tensor core and cuda core are executing their instruction one cycle apart and not in parallel. So SM can execute either tensor core instruction or cuda instruction in any given cycle and not both in same cycle. Is that true understanding?

Generally no, not correct. There are at least 2 factors to consider:

  1. All instructions are pipelined and have latency. For example, a multiply instruction issued in cycle 0 may not produce a result until e.g. cycle 4. Likewise for tensor core op (wmma). So if an ordinary multiply is issued in cycle 0, producing its result in cycle 4, and a tensor core op is issued in cycle 1, producing its result in cycle 5 (for the sake of discussion), then during cycles 2 and 3 the SM (functional units) are actively involved in processing both ops at the same time.

  2. Many modern SMs are broken into sub-partitions. Each sub-partition has a warp scheduler. So in the exact same cycle it is possible for a warp scheduler in sub-partition 0 to issue a tensor core op, while a warp scheduler in sub-partition 1 issues an ordinary multiply. These instructions would target separate functional units, of course.

1 Like

On the second point, can you give me an example? Thank you!

refer to page 22 here That is a picture of a single A100 SM. Note that there are 4 warp schedulers.

Ok, thank you. I think I understand what you said.