I need help understanding how concurrency of CUDA Cores and Tensor Cores works between Turing and Ampere/Ada?

There isn’t much difference between Turing, Ampere and Ada in this area.

This question in various forms comes up from time to time, here is a recent thread. It’s also necessary to have a basic understanding of how instructions are issued and how work is scheduled in CUDA GPUs, unit 3 of this online training series covers some of that.

I would also like to say as a caveat that when we get down into a detailed cycle-by-cycle investigation, such as in the linked thread above, that no CUDA programmer has control over the detailed cycle by cycle issuance of machine instructions. The machine handles that.

Finally, its useful to keep in mind that the tensorcore (TC) units are execution units in a GPU SM that at a high level behave similarly to other execution units (e.g. LD/ST, FP32 pipe, FP64 pipe, ALU pipe, etc.) A TC instruction gets issued warp-wide to a TC unit, just like an FMUL, FMA, or FADD instruction gets issued warp-wide to a FP32 unit, otherwise known as a “cuda core”.

Every SMSP has access to roughly a similar set of resources. If SMSP0 has a TC unit that it can issue to (that its warp scheduler can issue to) then almost certainly SMSP1 has a TC unit it can issue to, also.

There is no necessary relation between what type of instruction gets issued in a particular clock cycle on SMSP0 vs. what gets issued in that clock cycle on any other SMSP in that SM. One SMSP can issue a TC instruction while another can issue a FMUL, or a LD, or a ST, or another TC instruction, or what have you.

With respect to “FP16 code but on the tensor cores” an instruction is either going to target a TC unit, or it isn’t. A HADD (16 bit floating point add) or HMUL (16 bit floating point multiply) will never get sent to a TC unit, ever. To issue work to a TC unit, it must be a TC instruction, such as HMMA. But theoretically, each SMSP (i.e. their warp schedulers) could issue a HMMA instruction in the same clock cycle, because they have separate/dedicated TC units per SMSP.

I think I have answered this already. TC units are dedicated/specific to their SMSP, just like most other functional units in a GPU SM.

Don’t really understand. FP32/INT32 instructions and FP16 instructions, apart from TC instructions, do not get issued to TC units. There are no instructions that manipulate FP32 or INT32 data on a TC unit, except as a byproduct or derivative. (For example INT8 data passed to a TC unit for integer matrix-matrix multiply can accumulate into an FP32 datum.)

You could issue a INT32 or FP32 instruction in one SMSP, and in the same clock cycle, in another SMSP, issue a TC instruction.

“At the same time” requires unpacking. All functional units on a GPU SM are pipelined. That means that you might issue an instruction in clock cycle 0, but the results are not ready until clock cycle 6. During each of these 6 clock cycles, the pipelined nature of instruction execution means that the instruction you issued in cycle 0 is being worked on, in stages. Regarding a single SMSP, in cycle 0, that is all you can do: issue a single instruction. Regarding other SMSP, in cycle 0, other instructions could be issued. However suppose we issue in SMSP0 a TC instruction in cycle 0, and then in cycle 1 a FMUL instruction (32-bit floating point multiply) gets issued. Do we say those two instructions are being handled “at the same time”? You decide.