Instruction scheduling in Ampere

mahmood.nt · February 22, 2021, 6:21pm

Hi
In the Volta tuning guide it says the schedulers can issue independent instructions every cycle. I didn’t find similar section instruction scheduling in the Ampere tuning guide. So, maybe it is still 1 instruction issue per cycle, but I want to be sure. Is there any confirmation about that? I ask that because Ampere has two separate pipeline which is different from previous generations. So, is that 1 instruction issue per cycle or 2?

.

Robert_Crovella · February 22, 2021, 6:48pm

Based on the SM diagrams in the Ampere whitepapers, I would expect ampere to be able to issue up to 4 instructions per clock per SM.

rs277 · February 22, 2021, 6:51pm

Looks the same as Volta: Programming Guide :: CUDA Toolkit Documentation

mahmood.nt · February 22, 2021, 8:10pm

@rs277
In that link it is stated that at every instruction issue time, each scheduler issues one instruction for one of its assigned warps.
Since there are 4 schedulers, it will issue 4 instructions as Robert said.

rs277 · February 22, 2021, 8:39pm

@mahmood.nt
Yes, the same as for Volta, mentioned in the preceding Programming Guide entry: Programming Guide :: CUDA Toolkit Documentation

mahmood.nt · February 22, 2021, 8:58pm

@Robert_Crovella
With respect to two data paths in each partition in Ampere, does that mean each partition can issue up to 2 instructions (one INT and one FP or two FP) per cycle? It says

GA10X includes FP32 processing on both datapaths, doubling the peak processing rate for FP32 operations. One datapath in each partition consists of 16 FP32 CUDA Cores capable of executing 16 FP32 operations per clock. Another datapath consists of both 16 FP32 CUDA Cores and 16 INT32 Cores, and is capable of executing either 16 FP32 operations OR 16 INT32 operations per clock. As a result of this new design, each GA10x SM partition is capable of executing either 32 FP32 operations per clock, or 16 FP32 and 16 INT32 operations per clock.

Robert_Crovella · February 22, 2021, 10:16pm

I’m not aware that the details of this are fully exposed. The GA10X architecture (cc8.6) has 128 FP32 cores per SM, whereas the GA100 architecture (cc8.0) has 64 FP32 cores per SM. This dual datapath architecture was introduced in the Volta/Turing generation. I think this statement from the reference rs277 gave you is trustworthy:

" * 4 warp schedulers.

An SM statically distributes its warps among its schedulers. Then, at every instruction issue time, each scheduler issues one instruction for one of its assigned warps that is ready to execute, if any."

Yes, I realize that doesn’t provide a complete description of how the SM works, exactly. Please see my statement here which governs how I respond to some questions.

rs277 · February 22, 2021, 10:42pm

Looking at this exerpt from the Ampere Whitepaper (note GA10X or Compute 8.6):

“2x FP32 Throughput
In the Turing generation, each of the four SM processing blocks (also called partitions) had two primary datapaths, but only one of the two could process FP32 operations. The other datapath was limited to integer operations. GA10X includes FP32 processing on both datapaths, doubling the peak processing rate for FP32 operations. One datapath in each partition consists of 16 FP32 CUDA Cores capable of executing 16 FP32 operations per clock. Another datapath consists of both 16 FP32 CUDA Cores and 16 INT32 Cores, and is capable of executing either 16 FP32 operations OR 16 INT32 operations per clock. As a result of this new design, each GA10x SM partition is capable of executing either 32 FP32 operations per clock, or 16 FP32 and 16 INT32 operations per clock. All four SM partitions combined can execute 128 FP32 operations per clock, which is double the FP32 rate of the Turing SM, or 64 FP32 and 64 INT32 operations per clock.”

my take is that both Volta/Turing and Ampere 8.6 can issue 128 ops/clk. However, if you are primarily FP32, Ampere will potentially issue more ops/clk due to each partition having FP32 in each path, unlike Volta which only has FP32 in one.

Topic		Replies	Views
A Question about how Ampere/Lovelace (RTX 3000/4000, GA10X/AD10X) cards handle Warp Dispatching CUDA Programming and Performance	13	450	June 1, 2024
Clarifing the process of issuing instructions on CUDA devices CUDA Programming and Performance	5	332	March 26, 2024
Threads Dispatching : 2 different instructions per cycles? CUDA Programming and Performance	2	33	January 31, 2025
Is there a document about in which hardware unit(ie. ALU FMU...) an instruction is executed? CUDA Programming and Performance	35	2900	October 5, 2022
I need help understanding how concurrency of CUDA Cores and Tensor Cores works between Turing and Ampere/Ada? CUDA Programming and Performance cuda , tensorflow , rtx , ampere	10	1761	September 27, 2024
Cuda operations along side Tensor operations CUDA Programming and Performance	2	478	October 12, 2021
Understanding instruction dispatching in Volta architecture CUDA Programming and Performance	5	3503	December 12, 2019
Understanding CUDA scheduling CUDA Programming and Performance	4	15453	May 20, 2014
Can warps from different CTAs be coscheduled? CUDA Programming and Performance	5	230	July 6, 2024
Scheduler concept inside FERMI CUDA Programming and Performance	2	7245	March 25, 2011

Instruction scheduling in Ampere

Related topics