I need help understanding how concurrency of CUDA Cores and Tensor Cores works between Turing and Ampere/Ada?

Robert_Crovella · March 17, 2024, 7:35pm

There isn’t much difference between Turing, Ampere and Ada in this area.

This question in various forms comes up from time to time, here is a recent thread. It’s also necessary to have a basic understanding of how instructions are issued and how work is scheduled in CUDA GPUs, unit 3 of this online training series covers some of that.

I would also like to say as a caveat that when we get down into a detailed cycle-by-cycle investigation, such as in the linked thread above, that no CUDA programmer has control over the detailed cycle by cycle issuance of machine instructions. The machine handles that.

Finally, its useful to keep in mind that the tensorcore (TC) units are execution units in a GPU SM that at a high level behave similarly to other execution units (e.g. LD/ST, FP32 pipe, FP64 pipe, ALU pipe, etc.) A TC instruction gets issued warp-wide to a TC unit, just like an FMUL, FMA, or FADD instruction gets issued warp-wide to a FP32 unit, otherwise known as a “cuda core”.

Every SMSP has access to roughly a similar set of resources. If SMSP0 has a TC unit that it can issue to (that its warp scheduler can issue to) then almost certainly SMSP1 has a TC unit it can issue to, also.

There is no necessary relation between what type of instruction gets issued in a particular clock cycle on SMSP0 vs. what gets issued in that clock cycle on any other SMSP in that SM. One SMSP can issue a TC instruction while another can issue a FMUL, or a LD, or a ST, or another TC instruction, or what have you.

With respect to “FP16 code but on the tensor cores” an instruction is either going to target a TC unit, or it isn’t. A HADD (16 bit floating point add) or HMUL (16 bit floating point multiply) will never get sent to a TC unit, ever. To issue work to a TC unit, it must be a TC instruction, such as HMMA. But theoretically, each SMSP (i.e. their warp schedulers) could issue a HMMA instruction in the same clock cycle, because they have separate/dedicated TC units per SMSP.

I think I have answered this already. TC units are dedicated/specific to their SMSP, just like most other functional units in a GPU SM.

Don’t really understand. FP32/INT32 instructions and FP16 instructions, apart from TC instructions, do not get issued to TC units. There are no instructions that manipulate FP32 or INT32 data on a TC unit, except as a byproduct or derivative. (For example INT8 data passed to a TC unit for integer matrix-matrix multiply can accumulate into an FP32 datum.)

You could issue a INT32 or FP32 instruction in one SMSP, and in the same clock cycle, in another SMSP, issue a TC instruction.

“At the same time” requires unpacking. All functional units on a GPU SM are pipelined. That means that you might issue an instruction in clock cycle 0, but the results are not ready until clock cycle 6. During each of these 6 clock cycles, the pipelined nature of instruction execution means that the instruction you issued in cycle 0 is being worked on, in stages. Regarding a single SMSP, in cycle 0, that is all you can do: issue a single instruction. Regarding other SMSP, in cycle 0, other instructions could be issued. However suppose we issue in SMSP0 a TC instruction in cycle 0, and then in cycle 1 a FMUL instruction (32-bit floating point multiply) gets issued. Do we say those two instructions are being handled “at the same time”? You decide.

Topic		Replies	Views
Cuda operations along side Tensor operations CUDA Programming and Performance	2	497	October 12, 2021
GPU architecture and CUDA kernel execution CUDA Programming and Performance	13	24955	September 6, 2009
Multiprocessors or Cuda Cores CUDA Programming and Performance	25	19883	July 5, 2011
warp and core What's the relationship between warp and core? CUDA Programming and Performance	12	15665	February 4, 2011
Fermi architecture details where can I find them? CUDA Programming and Performance	16	4065	April 8, 2012
Why scalar processors? CUDA Programming and Performance	21	18382	June 26, 2009
Warp Size Question CUDA Programming and Performance	21	14093	June 18, 2010
FP32 and FP16 activity during a pure 32bit float CUDA application is running CUDA Programming and Performance	4	1179	April 26, 2018
Fermi? Sounds interesting... CUDA Programming and Performance	58	15636	October 18, 2009
Theoretical peak performance question GF100 can't co-issue instructions can it? CUDA Programming and Performance	15	3380	March 3, 2011

I need help understanding how concurrency of CUDA Cores and Tensor Cores works between Turing and Ampere/Ada?

Related topics