I need help understanding how concurrency of CUDA Cores and Tensor Cores works between Turing and Ampere/Ada?

On Ampere/Ada generation GPUs, they have SMs that are subdivided into 4 SMSPs with their own allocated register files, tensor cores, Warp scheduler etc. when it comes to issuing a command, within an SM if I’m not mistaken you can have SMSP0 and SMSP1 issue and do FP32/INT32 code, but can you have SMSP2 and SMSP3 issue and do FP16 code but on the tensor cores?

And can it be all issued at the same time by their warp scheduler, or does it happen in a cyclic nature where SMSP0 goes first and for example complete it by cycle 4, and SMSP1 starts at the second cycle and completes it at Cycle 5, so on and so forth in a sequential fashion?

If they can all be issued at the same time due to having individual SMSP within the whole SM, I have a follow up question, can you issue FP32/INT32 for vector workloads at the same time as FP16 also for vector workloads on the Tensor Cores? Or is it only limited for scalar workloads in which you can have this level of concurrency?

Would the CUDA cores work at the same time as the Tensor Cores within an SM at executing FP32/INT32 and FP16/INT8 respectively and for vector and scalar workloads? And if there are differences in how Turing and Ampere/Ada handle this behavior, what are they?

There isn’t much difference between Turing, Ampere and Ada in this area.

This question in various forms comes up from time to time, here is a recent thread. It’s also necessary to have a basic understanding of how instructions are issued and how work is scheduled in CUDA GPUs, unit 3 of this online training series covers some of that.

I would also like to say as a caveat that when we get down into a detailed cycle-by-cycle investigation, such as in the linked thread above, that no CUDA programmer has control over the detailed cycle by cycle issuance of machine instructions. The machine handles that.

Finally, its useful to keep in mind that the tensorcore (TC) units are execution units in a GPU SM that at a high level behave similarly to other execution units (e.g. LD/ST, FP32 pipe, FP64 pipe, ALU pipe, etc.) A TC instruction gets issued warp-wide to a TC unit, just like an FMUL, FMA, or FADD instruction gets issued warp-wide to a FP32 unit, otherwise known as a “cuda core”.

Every SMSP has access to roughly a similar set of resources. If SMSP0 has a TC unit that it can issue to (that its warp scheduler can issue to) then almost certainly SMSP1 has a TC unit it can issue to, also.

There is no necessary relation between what type of instruction gets issued in a particular clock cycle on SMSP0 vs. what gets issued in that clock cycle on any other SMSP in that SM. One SMSP can issue a TC instruction while another can issue a FMUL, or a LD, or a ST, or another TC instruction, or what have you.

With respect to “FP16 code but on the tensor cores” an instruction is either going to target a TC unit, or it isn’t. A HADD (16 bit floating point add) or HMUL (16 bit floating point multiply) will never get sent to a TC unit, ever. To issue work to a TC unit, it must be a TC instruction, such as HMMA. But theoretically, each SMSP (i.e. their warp schedulers) could issue a HMMA instruction in the same clock cycle, because they have separate/dedicated TC units per SMSP.

I think I have answered this already. TC units are dedicated/specific to their SMSP, just like most other functional units in a GPU SM.

Don’t really understand. FP32/INT32 instructions and FP16 instructions, apart from TC instructions, do not get issued to TC units. There are no instructions that manipulate FP32 or INT32 data on a TC unit, except as a byproduct or derivative. (For example INT8 data passed to a TC unit for integer matrix-matrix multiply can accumulate into an FP32 datum.)

You could issue a INT32 or FP32 instruction in one SMSP, and in the same clock cycle, in another SMSP, issue a TC instruction.

“At the same time” requires unpacking. All functional units on a GPU SM are pipelined. That means that you might issue an instruction in clock cycle 0, but the results are not ready until clock cycle 6. During each of these 6 clock cycles, the pipelined nature of instruction execution means that the instruction you issued in cycle 0 is being worked on, in stages. Regarding a single SMSP, in cycle 0, that is all you can do: issue a single instruction. Regarding other SMSP, in cycle 0, other instructions could be issued. However suppose we issue in SMSP0 a TC instruction in cycle 0, and then in cycle 1 a FMUL instruction (32-bit floating point multiply) gets issued. Do we say those two instructions are being handled “at the same time”? You decide.

Don’t really understand. FP32/INT32 instructions and FP16 instructions, apart from TC instructions, do not get issued to TC units. There are no instructions that manipulate FP32 or INT32 data on a TC unit, except as a byproduct or derivative. (For example INT8 data passed to a TC unit for integer matrix-matrix multiply can accumulate into an FP32 datum.)

Sorry I should have clarified, I meant that if in SMSP0 a Vector workload is issued to be handled and in SMSP1 what’s issued is a vector load to be handled by the Tensor Cores, A) is that possible and B) can it be issued at the same time? But also, are you saying that the Tensor Cores cannot handle vector workloads? If yes that would address part of the above question. (Although would be a bit contrary to some other comments and docs I’ve seen)

With respect to “FP16 code but on the tensor cores” an instruction is either going to target a TC unit, or it isn’t. A HADD (16 bit floating point add) or HMUL (16 bit floating point multiply) will never get sent to a TC unit, ever .

So, in the case of performing mixed precision, how would this work out for graphics? Can, within the SM, the 4 SMSPs handle both the FP16 on the tensor and FP32/INT32 on the CUDA Cores for graphical work and at the same time within an SM? Or am I misunderstanding it?

I have no idea what that is. A Tensor Core unit handles matrix-matrix multiply instructions only.

Tensor Cores don’t handle vector loads, or loads of any kind. Certain instructions exist that load data into a register patch for use by a tensor core instruction (i.e. a matrix-matrix multiply instruction), but these are not actually using the tensor core units to do the loading work.

Yes, they can do that for compute work or for graphics work, using previously supplied definitions of “at the same time” (1. in the same clock cycle, in different SMSPs, or 2. In different clock cycles in the same SMSP).

I have no idea what that is. A Tensor Core unit handles matrix-matrix multiply instructions only.

I may have worded it incorrectly. Was talking about this: Vector processing is a computer method that can process numerous data components at once. It operates on every element of the entire vector in one operation, or in parallel, to avoid the overhead of the processing loop.

Vector Processing | InfluxData.

Honestly, it seems like you’re making things up. Tensor Cores don’t handle vector loads, or loads of any kind. Certain instructions exist that load data into a register patch for use by a tensor core instruction (i.e. a matrix-matrix multiply instruction), but these are not actually using the tensor core units to do the loading work.

Not loading as in, it is loaded in. But workload or a task in this context. I apologize if this seems directionless but I’m more trying to understand how the Tensor Cores are involved within the SM and in rendering and if you can offload a task that doesn’t need precision to those cores and they offer a huge compute to accelerate it but at a lower level, besides something like DLSS. Considering the previous responses I’m going to assume no unless it is an instruction that is a format suited for tensor core work to begin with but in that case why even have it for the CUDA cores to begin with? In that scenario it was going to go to the tensor cores anyway since it’s an instruction for the Tensor Cores. But I digress.

The main intent behind the confusion is me looking at things like the GA102 Whitepaper and seeing lines like this

The GA10x SM continues to support double-speed FP16 (HFMA) operations which are
supported in Turing. And similar to TU102, TU104, and TU106 Turing GPUs, standard FP16
operations are handled by the Tensor Cores in GA10x GPUs

Which indicate that “Standard” FP16 workloads. IE Not Accumulative or Sparsity are directed onto the Tensor cores. Working at a 1:1 rate with the FP32 of the CUDA cores. Unless I’m misunderstanding some degree of the terminology in the GA102 Whitepaper.

Yes, they can do that for compute work or for graphics work, using previously supplied definitions of “at the same time” (1. in the same clock cycle, in different SMSPs, or 2. In different clock cycles in the same SMSP).

I see, thank you.