Instruction scheduling in Ampere

I’m not aware that the details of this are fully exposed. The GA10X architecture (cc8.6) has 128 FP32 cores per SM, whereas the GA100 architecture (cc8.0) has 64 FP32 cores per SM. This dual datapath architecture was introduced in the Volta/Turing generation. I think this statement from the reference rs277 gave you is trustworthy:

" * 4 warp schedulers.

An SM statically distributes its warps among its schedulers. Then, at every instruction issue time, each scheduler issues one instruction for one of its assigned warps that is ready to execute, if any."

Yes, I realize that doesn’t provide a complete description of how the SM works, exactly. Please see my statement here which governs how I respond to some questions.

1 Like