A Question about how Ampere/Lovelace (RTX 3000/4000, GA10X/AD10X) cards handle Warp Dispatching

So, I was looking around and wanted more insight on how the Consumer/Gaming RTX 3000/4000 cards handle warp dispatching.

This is because Turing and A100 cards can only send out a single 16 thread half-warp per cycle (because they only have 64 FP32 cores per SM.etc), but GA102 has 2 blocks of cores with both able to execute FP32, and the GA102 whitepaper say it can execute 2xFP32 per clock/cycle.

So can GA10X Ampere dispatch and execute a FMUL, FADD, or FFMA instruction across all 32 threads in a single partition per warp, or can it only do 16 thread half warps at a time like Turing or A100.

From the GA10X whitepaper:

2x FP32 Throughput
In the Turing generation, each of the four SM processing blocks (also called partitions) had two
primary datapaths, but only one of the two could process FP32 operations. The other datapath
was limited to integer operations. GA10X includes FP32 processing on both datapaths, doubling
the peak processing rate for FP32 operations. One datapath in each partition consists of 16
FP32 CUDA Cores capable of executing 16 FP32 operations per clock. Another datapath
consists of both 16 FP32 CUDA Cores and 16 INT32 Cores, and is capable of executing either
16 FP32 operations OR 16 INT32 operations per clock. As a result of this new design, each
GA10x SM partition is capable of executing either 32 FP32 operations per clock, or 16 FP32
and 16 INT32 operations per clock. All four SM partitions combined can execute 128 FP32
operations per clock, which is double the FP32 rate of the Turing SM, or 64 FP32 and 64 INT32 operations per clock.

This is consistent with the stated throughput per SM per clock in the table in the programming guide, which indicates a throughput of 128 for cc8.6/8.9, for 32-bit floating-point add, multiply, multiply-add.

As far as I understand the numbers are throughput numbers and for the 32 FP32 instructions per SM partition, you need 2 FP32 instructions. One is dispatched to the FP32 unit, one is dispatched to the combined INT32/FP32 unit. It is not possible to split a warp in such a way that half of the lanes are processed by one and the other half by the other unit.

On gaming Ampere and Lovelace, while the total level of performance remains the same, I’m curious to see if they achieve their FP32 throughput by dispatching half a warp to both FP32 capable data paths (FP32 and the FP32/INT32 paths) or whether they dispatch a full warp to each FP32 data path separately but on an alternate cycle? I’m more curious on how the dispatch unit handles this.

In scenario 2, the way I understand it is that there can be two FP32 capable data paths in an SMSP, with each divided into 16 blocks, so seems like it can execute 16 operations per clock. Now, if there was an FP32 instruction for a WARP is issued, and said instruction is 32 threads, each FP32 data path would require two cycles to execute this on one of the data paths.

However, what I want to know is if this if this can be handled by both data paths requiring one cycle instead of 2, or am I mistaken on this interpretation? With Turing, it has 1 path that is capable of FP32 per SMSP, but on Ampere it has 2 paths therefore it changes things here a bit and makes me curious.

Again to note, specifically gaming Ampere and Lovelace.

Personally, I would concur with Curefab. I suspect that the GPU cannot take a single warp/single instruction FMUL/FADD/FFMA, and divide that issue between the two separate datapaths, for the same instruction/warp in the same cycle. However I don’t know of anywhere that it is documented. All of the numbers I excerpted from the white paper and/or the programming guide appear to be throughput numbers. I’m also fairly convinced that since we started having GPUs with multiple SMSPs, that none of the warp schedulers in those cases are dual-issue capable. But at the moment I cannot assemble the doc links to support that claim.

in short, I don’t have any documentation that I know of to refer to, to indicate the detailed scheduling behavior at this level.

I see, when searching before I found a response you provided before that sounds a bit conflicting.

In such case, for the ALU performing the FP32 (aka the “CUDA core”), and in the case of Ampere gaming and later, they do “technically” have 32 capable FP32 units per SMSP. However, it’s not fully clear to me since it’s Ampere and not Turing, with Turing only having 16 of those capable units per SMSP (64 across the entire SM) meanwhile ampere gaming and later have 128 capable per SM (or 32 per SMSP).

You could just try it out with micro-benchmarks comparing odd and even numbers of operations. The odd-numbered have full throughput only, if the warp can be divided unto the two units.

I’m not sure why it sounds conflicting. The ampere whitepaper indicates the presence of 2 datapaths for each SMSP, where each datapath has 16 FP units.

If we consider things at the datapath level (and I have already indicated that I suspect this is the proper view, but cannot confirm with documentation) then it should be clear that in a single clock cycle, an instruction will be issued to either the FP32 data path, or else to the combined FP32/INT32 datapath, but not both, in the same cycle, for the same warp/instruction.

Given that, the behavior is roughly consistent with the other post you linked, where I stated that If the SMSP does not have 32, but has, instead, 16, then it will require 2 clocks to fully issue the instruction. So I think this datapath view has to be considered.

It is possible I am wrong. One of the reasons I don’t think I am wrong is that the two datapaths do have somewhat different capabilities. One is FP32 only while the other can handle either FP32 or INT32. Clearly, INT32 is a different instruction, and since I am also fairly convinced that the warp schedulers are not dual-issue capable, there is no way to get both datapaths issued in the same clock cycle if you are issuing INT32 mixed with FP32. Therefore I conclude the numbers presented (mostly) are throughput numbers. To get full throughput, you would have to issue alternating INT32 and FP32 instructions, I suspect. In the FP32 only case, you would still get full throughput by issuing back-to-back FP32 instructions, alternating cycle-by-cycle to each datapath. The only quibble would be about whether or not a sequence requires one extra clock or not due to alternation of issue, vs. split-issue. I find such considerations to be outside the realm of anything I care about or can describe. I doubt they matter from an actual performance perspective.

Anyhow, I don’t know of detailed published descriptions of scheduler behavior at this level. You’re welcome to ask questions about it, of course, but from a programmer’s perspective, in my opinion such investigation is mostly irrelevant. You don’t control detailed SASS instruction scheduling, and the machine is largely a throughput machine anyway. It is designed to give good performance even without the programmer attempting/being able to do instruction-by-instruction scheduling. NVIDIA generally does not provide tools to give the programmer this level of control.

This causing curiosity again.

FMA Heavy and FMA light can conflict.

FMA light can not be addressed integer dot product instructions.

I dont think light can do fmadd either. Maybe one or two more?

That leaves many instructions that can be performed on the full range of fma alu’s.

For instructions with no conflict, I dont know if it can address 1 instruction across 32 fma.

Instructions issued are always greater than or equal to instructions executed, but not starting at 2x greater.

An instruction is issued exactly once, even if there are only 16 compute units and 32 threads.

1 Like

Oh thats right. No more instruction replay from schedulers, its all in the memory system now correct?

Yes, that numbers are on purpose.

Look at the Kepler generation with 192 Cuda cores with 4 schedulers, which could each schedule up to 2 instructions per cycle. Those needed a very complicated logic to coordinate and handle.

Volta and all architectures afterwards, which are still basically Volta, cleaned it all up and simplified it.

Just 1 scheduler per SM Partition, just 1 new instruction per cycle.

Nevertheless, one wants to have different kinds of specialized execution units and use them to the fullest with typical instruction mixes. So the throughput of each instruction unit type has to be less than 1 instruction per cycle or less than 32 threads per cycle to be able to fill more than one type of computation unit at a time.

It is on purpose that there are only 16 units of a kind per SM Partition. Nvidia calculated that it is better to build double the number of SMs than put 32 units of a kind per SM Partition.
16 units take half as much chip die area on the GPU.

In the end it is like a SM Partition can execute 2 instructions at the same time, but with overall half the speed. As Robert says, the GPUs are optimized for throughput, not for speed running through (~latency).

Even the high-end GPUs have this setup.

I feel like thats the same speed.

I dont think this question is about masking latency with concurrency on the fma’s. We already know it can.

I think maybe we are looking at the wrong area. The scheduler doesnt dispatch warp instructions directly to the fma’s. Its in the memory system, it sends them to the lsu/mio, i feel like there is where this answer would be.

Definitely no. The FMA instructions for FP32 are not going through the MIO pipe.