I need help understanding how concurrency of CUDA Cores and Tensor Cores works between Turing and Ampere/Ada?

On Ampere/Ada generation GPUs, they have SMs that are subdivided into 4 SMSPs with their own allocated register files, tensor cores, Warp scheduler etc. when it comes to issuing a command, within an SM if I’m not mistaken you can have SMSP0 and SMSP1 issue and do FP32/INT32 code, but can you have SMSP2 and SMSP3 issue and do FP16 code but on the tensor cores?

And can it be all issued at the same time by their warp scheduler, or does it happen in a cyclic nature where SMSP0 goes first and for example complete it by cycle 4, and SMSP1 starts at the second cycle and completes it at Cycle 5, so on and so forth in a sequential fashion?

If they can all be issued at the same time due to having individual SMSP within the whole SM, I have a follow up question, can you issue FP32/INT32 for vector workloads at the same time as FP16 also for vector workloads on the Tensor Cores? Or is it only limited for scalar workloads in which you can have this level of concurrency?

Would the CUDA cores work at the same time as the Tensor Cores within an SM at executing FP32/INT32 and FP16/INT8 respectively and for vector and scalar workloads? And if there are differences in how Turing and Ampere/Ada handle this behavior, what are they?

There isn’t much difference between Turing, Ampere and Ada in this area.

This question in various forms comes up from time to time, here is a recent thread. It’s also necessary to have a basic understanding of how instructions are issued and how work is scheduled in CUDA GPUs, unit 3 of this online training series covers some of that.

I would also like to say as a caveat that when we get down into a detailed cycle-by-cycle investigation, such as in the linked thread above, that no CUDA programmer has control over the detailed cycle by cycle issuance of machine instructions. The machine handles that.

Finally, its useful to keep in mind that the tensorcore (TC) units are execution units in a GPU SM that at a high level behave similarly to other execution units (e.g. LD/ST, FP32 pipe, FP64 pipe, ALU pipe, etc.) A TC instruction gets issued warp-wide to a TC unit, just like an FMUL, FMA, or FADD instruction gets issued warp-wide to a FP32 unit, otherwise known as a “cuda core”.

Every SMSP has access to roughly a similar set of resources. If SMSP0 has a TC unit that it can issue to (that its warp scheduler can issue to) then almost certainly SMSP1 has a TC unit it can issue to, also.

There is no necessary relation between what type of instruction gets issued in a particular clock cycle on SMSP0 vs. what gets issued in that clock cycle on any other SMSP in that SM. One SMSP can issue a TC instruction while another can issue a FMUL, or a LD, or a ST, or another TC instruction, or what have you.

With respect to “FP16 code but on the tensor cores” an instruction is either going to target a TC unit, or it isn’t. A HADD (16 bit floating point add) or HMUL (16 bit floating point multiply) will never get sent to a TC unit, ever. To issue work to a TC unit, it must be a TC instruction, such as HMMA. But theoretically, each SMSP (i.e. their warp schedulers) could issue a HMMA instruction in the same clock cycle, because they have separate/dedicated TC units per SMSP.

I think I have answered this already. TC units are dedicated/specific to their SMSP, just like most other functional units in a GPU SM.

Don’t really understand. FP32/INT32 instructions and FP16 instructions, apart from TC instructions, do not get issued to TC units. There are no instructions that manipulate FP32 or INT32 data on a TC unit, except as a byproduct or derivative. (For example INT8 data passed to a TC unit for integer matrix-matrix multiply can accumulate into an FP32 datum.)

You could issue a INT32 or FP32 instruction in one SMSP, and in the same clock cycle, in another SMSP, issue a TC instruction.

“At the same time” requires unpacking. All functional units on a GPU SM are pipelined. That means that you might issue an instruction in clock cycle 0, but the results are not ready until clock cycle 6. During each of these 6 clock cycles, the pipelined nature of instruction execution means that the instruction you issued in cycle 0 is being worked on, in stages. Regarding a single SMSP, in cycle 0, that is all you can do: issue a single instruction. Regarding other SMSP, in cycle 0, other instructions could be issued. However suppose we issue in SMSP0 a TC instruction in cycle 0, and then in cycle 1 a FMUL instruction (32-bit floating point multiply) gets issued. Do we say those two instructions are being handled “at the same time”? You decide.

Don’t really understand. FP32/INT32 instructions and FP16 instructions, apart from TC instructions, do not get issued to TC units. There are no instructions that manipulate FP32 or INT32 data on a TC unit, except as a byproduct or derivative. (For example INT8 data passed to a TC unit for integer matrix-matrix multiply can accumulate into an FP32 datum.)

Sorry I should have clarified, I meant that if in SMSP0 a Vector workload is issued to be handled and in SMSP1 what’s issued is a vector load to be handled by the Tensor Cores, A) is that possible and B) can it be issued at the same time? But also, are you saying that the Tensor Cores cannot handle vector workloads? If yes that would address part of the above question. (Although would be a bit contrary to some other comments and docs I’ve seen)

With respect to “FP16 code but on the tensor cores” an instruction is either going to target a TC unit, or it isn’t. A HADD (16 bit floating point add) or HMUL (16 bit floating point multiply) will never get sent to a TC unit, ever .

So, in the case of performing mixed precision, how would this work out for graphics? Can, within the SM, the 4 SMSPs handle both the FP16 on the tensor and FP32/INT32 on the CUDA Cores for graphical work and at the same time within an SM? Or am I misunderstanding it?

I have no idea what that is. A Tensor Core unit primarily handles matrix-matrix multiply instructions. On Turing at least (see below) there may be other types of FP16 ops that are somehow processed by the TC units.

Tensor Cores don’t handle vector loads, or loads of any kind. Certain instructions exist that load data into a register patch for use by a tensor core instruction (i.e. a matrix-matrix multiply instruction), but these are not actually using the tensor core units to do the loading work.

Yes, they can do that for compute work or for graphics work, using previously supplied definitions of “at the same time” (1. in the same clock cycle, in different SMSPs, or 2. In different clock cycles in the same SMSP).

I have no idea what that is. A Tensor Core unit handles matrix-matrix multiply instructions only.

I may have worded it incorrectly. Was talking about this: Vector processing is a computer method that can process numerous data components at once. It operates on every element of the entire vector in one operation, or in parallel, to avoid the overhead of the processing loop.

Vector Processing | InfluxData.

Honestly, it seems like you’re making things up. Tensor Cores don’t handle vector loads, or loads of any kind. Certain instructions exist that load data into a register patch for use by a tensor core instruction (i.e. a matrix-matrix multiply instruction), but these are not actually using the tensor core units to do the loading work.

Not loading as in, it is loaded in. But workload or a task in this context. I apologize if this seems directionless but I’m more trying to understand how the Tensor Cores are involved within the SM and in rendering and if you can offload a task that doesn’t need precision to those cores and they offer a huge compute to accelerate it but at a lower level, besides something like DLSS. Considering the previous responses I’m going to assume no unless it is an instruction that is a format suited for tensor core work to begin with but in that case why even have it for the CUDA cores to begin with? In that scenario it was going to go to the tensor cores anyway since it’s an instruction for the Tensor Cores. But I digress.

The main intent behind the confusion is me looking at things like the GA102 Whitepaper and seeing lines like this

The GA10x SM continues to support double-speed FP16 (HFMA) operations which are
supported in Turing. And similar to TU102, TU104, and TU106 Turing GPUs, standard FP16
operations are handled by the Tensor Cores in GA10x GPUs

Which indicate that “Standard” FP16 workloads. IE Not Accumulative or Sparsity are directed onto the Tensor cores. Working at a 1:1 rate with the FP32 of the CUDA cores. Unless I’m misunderstanding some degree of the terminology in the GA102 Whitepaper.

Yes, they can do that for compute work or for graphics work, using previously supplied definitions of “at the same time” (1. in the same clock cycle, in different SMSPs, or 2. In different clock cycles in the same SMSP).

I see, thank you.

According to NVIDIA, they handle regular FP16 ops, so what’s up with that?

I’m not sure what you are asking. vector types are defined in the cuda header file vector_types.h.

On a GPU, a load instruction (e.g. LD, LDG, LDS, etc.) is something that moves data from memory to a register. Load instructions are not handled by Tensor Core (TC) hardware, they are handled by LSU. This is true for vector loads (loading of a vector type into a register set), as well.

Once data is in register(s), the TC ops can operate on that data (i.e., do matrix-multiply operations).

Perhaps there is more to it:

Something that escaped my attention with the original TU102 GPU and the RTX 2080 Ti was that for Turing, NVIDIA changed how standard FP16 operations were handled. Rather than processing it through their FP32 CUDA cores, as was the case for GP100 Pascal and GV100 Volta, NVIDIA instead started routing FP16 operations through their tensor cores.

The tensor cores are of course FP16 specialists, and while sending standard (non-tensor) FP16 operations through them is major overkill, it’s certainly a valid route to take with the architecture. In the case of the Turing architecture, this route offers a very specific perk: it means that NVIDIA can dual-issue FP16 operations with either FP32 operations or INT32 operations, essentially giving the warp scheduler a third option for keeping the SM partition busy. Note that this doesn’t really do anything extra for FP16 performance – it’s still 2x FP32 performance – but it gives NVIDIA some additional flexibility.

Of course, as we just discussed, the Turing Minor does away with the tensor cores in order to allow for a learner GPU. So what happens to FP16 operations? As it turns out, NVIDIA has introduced dedicated FP16 cores!

These FP16 cores are brand new to Turing Minor, and have not appeared in any past NVIDIA GPU architecture. Their purpose is functionally the same as running FP16 operations through the tensor cores on Turing Major: to allow NVIDIA to dual-issue FP16 operations alongside FP32 or INT32 operations within each SM partition. And because they are just FP16 cores, they are quite small. NVIDIA isn’t giving specifics, but going by throughput alone they should be a fraction of the size of the tensor cores they replace.

To users and developers this shouldn’t make a difference – CUDA and other APIs abstract this and FP16 operations are simply executed wherever the GPU architecture intends for them to go – so this is all very transparent. But it’s a neat insight into how NVIDiA has optimized Turing Minor for die size while retaining the basic execution flow of the architecture.

Now the bigger question in my mind: why is it so important to NVIDIA to be able to dual-issue FP32 and FP16 operations, such that they’re willing to dedicate die space to fixed FP16 cores? Are they expecting these operations to be frequently used together within a thread? Or is it just a matter of execution ports and routing? But that is a question we’ll have to save for another day.

You said tensor cores only handle matrix ops, but NVIDIA says they also handle regular FP16 ops on Turing, Ampere and Ada, I hope we get some calcification on why there is some discrepancy here.

I’ve edited my previous comments. I think the discrepancy is now removed.

If you are trying understand the microarchitecture of the SM the first step is to never use the term CUDA core or Tensor core as these are marketing terms that help provide a magnitude of performance.

At a high level each SM consists of the following hierarchy (this can change slightly between architectures):

  • SM Core
    • High level warp scheduler that is concerned with overall scheduling policy.
  • SM Sub-partitions (4)
    • Warp scheduler that is responsible for cycle by cycle scheduling.
    • General register file (allocated per warp across all 32 lanes)
    • Uniform register file (allocated per warp)
    • Fixed latency math pipelines (fma, fmaheavy, fmalite, fp64, alu, mma (tensor), uniform)
    • Variable/long latency math pipelines (xu/mufu/sfu)
    • Dispatch port for variable latency math and memory instructions.
    • Immediate constant cache
  • MIO/L1TEX
    • Instructions are enqueued from SM Sub-partition into instructions queues and MIO dispatches to shared variable latency units such as
      • fp64pipeline on consumer cards
      • idc - indexed constant cache
      • lsu - load store unit (global, local shared, dsmem)
      • texture units (tex, surface)
      • tensor memory accelerator - dma controller

On each cycle the SM sub-partition warp schedulers independently select an active, non-stalled warp to issue an instruction. The instruction can be dispatch to one of the fixed latency math paths or after collecting operands enqueued into an instruction queue in the MIO. Some pipelines such as fma* and alu are not 32 lanes wide so a warp may be selected and dispatched over multiple cycles.

On each cycle the MIO unit selects instructions from various shared instruction queues including LSU, TEX, branch unit, and shared math units and dispatches the instruction if the pipeline is available.

A given SM warp scheduler can select a FFMA instruction from any warp and issue it to the fma* pipe. On the next cycle the scheduler can select a HMMA instruction and issue it to the MMA pipe. Dispatch can take multiple cycles. Each of these pipelines is many cycles so these instructions are “in flight” or “executing” on the same cycle.

The maximum instruction issue sustained rate on Volta - Hopper is 1 instruction per cycle per SM sub-partition. The number of instruction is the pipeline can be extremely high due to the depth of some of the pipelines and the number of outstanding memory operations via LSU/TEX to L1TEX unit.

Can you interleave FP16x2 and MMA?
On Volta - GA100 (not GA10x) the FP16x2 math pipeline shares the same dispatch port as the MMA pipelines so there can be contention. Whether it uses the exact same gates or just the wires to read and write the registers is irrelevant to the developer and it changes often.
On GA10x+ with 2x FP32 the fma pipe was split into the fmaheavy pipe (fp32 + imad + a few other instructions_ and fmalite pipe (fp32). fp16x2 vectors are split and simultaneously sent down the fma pipes.

Does this help you as a programmer?
Not really. If you know the instruction issue rate the only reason to care about the pipeline is either (a) specific instructions can only be issued to one pipe (e.g. imad down fmaheavy), or (b) you are using the profiler and need to understand when the pipe utilization is high what instructions are consuming that pipeline.

Can you improve performance by interleaving MMA and FP16x2?
Potentially, but in my opinion even if you have full access to the assembler and know every detail it is really hard to optimize this. The better method is to use multiple warps and allow the warp scheduler to interleave instructions to available pipes. Cutlass warp specialization is a technique that can achieve speed of light for a pipeline by minimizing the number of warps using a pipe so that the pipe can be driven at maximum speed without runtime conflicts such as register file arbitration conflicts (stall_dispatch). On Volta - Ada the MMA unit input and output is the register file. A warp has to (a) load data into the register file, (b) issue MMAs, (c) process the result, (d) write data out. Speed of light is often best performed by multi-buffering this pipeline across warps such that no two warps are in the same pipeline at the same time. Alternatively using shared memory barriers and the asynchonous pipeline warps can be assigned specific roles - load data to shared, issue MMA instructions, process the results, etc.

If you are trying to get the highest performance out of the NVIDIA GPU, especially for Tensors, I recommend reviewing cutlass. In most cases if an accelerated library for your domain exists then use the library unless you find a performance issue or and edge case that is not well supported.

Do NVIDIA GPUs support vector operations?
Yes. The list of hardware vector instructions changes with architectures and chips within the architecture. The CUDA compiler and PTX support vector abstractions that may map to a single SASS instruction or may be emulated as a sequence of instructions. SASS vector instructions start with V (e.g. VIADD) or end with the vector size (e.g. HFMA2 = 2xFP16 FMA).

Learning the Microarchitecture
If you want to learn the details of the microarchitecture there are a few really good methods:

  1. Write CUDA code and review the PTX and SASS. This can be done in godbolt. Once you have written the code you can manipulate the compiler settings to compile for different compute and sm architectures to see how architecture changes (e.g. 2xFP32 in GA10x) impact instruction schedule.
  2. Read the PTX manual.
  3. Write small 5-10 line microbenchmark kernels and look at the SASS and profiler output.
  4. Read professional and academic papers that breakdown the GPU.
  5. Single step optimized SASS code (no -G) in the debugger. This is a great way to learn SASS and learn concept such as lane predication, divergent branching, barriers, etc.
3 Likes