Instruction scheduling in Ampere

Hi
In the Volta tuning guide it says the schedulers can issue independent instructions every cycle. I didn’t find similar section instruction scheduling in the Ampere tuning guide. So, maybe it is still 1 instruction issue per cycle, but I want to be sure. Is there any confirmation about that? I ask that because Ampere has two separate pipeline which is different from previous generations. So, is that 1 instruction issue per cycle or 2?

.

Based on the SM diagrams in the Ampere whitepapers, I would expect ampere to be able to issue up to 4 instructions per clock per SM.

1 Like

Looks the same as Volta: Programming Guide :: CUDA Toolkit Documentation

1 Like

@rs277
In that link it is stated that at every instruction issue time, each scheduler issues one instruction for one of its assigned warps.
Since there are 4 schedulers, it will issue 4 instructions as Robert said.

@mahmood.nt
Yes, the same as for Volta, mentioned in the preceding Programming Guide entry: Programming Guide :: CUDA Toolkit Documentation

1 Like

@Robert_Crovella
With respect to two data paths in each partition in Ampere, does that mean each partition can issue up to 2 instructions (one INT and one FP or two FP) per cycle? It says

GA10X includes FP32 processing on both datapaths, doubling the peak processing rate for FP32 operations. One datapath in each partition consists of 16 FP32 CUDA Cores capable of executing 16 FP32 operations per clock. Another datapath consists of both 16 FP32 CUDA Cores and 16 INT32 Cores, and is capable of executing either 16 FP32 operations OR 16 INT32 operations per clock. As a result of this new design, each GA10x SM partition is capable of executing either 32 FP32 operations per clock, or 16 FP32 and 16 INT32 operations per clock.

I’m not aware that the details of this are fully exposed. The GA10X architecture (cc8.6) has 128 FP32 cores per SM, whereas the GA100 architecture (cc8.0) has 64 FP32 cores per SM. This dual datapath architecture was introduced in the Volta/Turing generation. I think this statement from the reference rs277 gave you is trustworthy:

" * 4 warp schedulers.

An SM statically distributes its warps among its schedulers. Then, at every instruction issue time, each scheduler issues one instruction for one of its assigned warps that is ready to execute, if any."

Yes, I realize that doesn’t provide a complete description of how the SM works, exactly. Please see my statement here which governs how I respond to some questions.

1 Like

Looking at this exerpt from the Ampere Whitepaper (note GA10X or Compute 8.6):

“2x FP32 Throughput
In the Turing generation, each of the four SM processing blocks (also called partitions) had two primary datapaths, but only one of the two could process FP32 operations. The other datapath was limited to integer operations. GA10X includes FP32 processing on both datapaths, doubling the peak processing rate for FP32 operations. One datapath in each partition consists of 16 FP32 CUDA Cores capable of executing 16 FP32 operations per clock. Another datapath consists of both 16 FP32 CUDA Cores and 16 INT32 Cores, and is capable of executing either 16 FP32 operations OR 16 INT32 operations per clock. As a result of this new design, each GA10x SM partition is capable of executing either 32 FP32 operations per clock, or 16 FP32 and 16 INT32 operations per clock. All four SM partitions combined can execute 128 FP32 operations per clock, which is double the FP32 rate of the Turing SM, or 64 FP32 and 64 INT32 operations per clock.”

my take is that both Volta/Turing and Ampere 8.6 can issue 128 ops/clk. However, if you are primarily FP32, Ampere will potentially issue more ops/clk due to each partition having FP32 in each path, unlike Volta which only has FP32 in one.

1 Like