Dual-issue and other timing behavior of the Kepler warp scheduler?

michaelrrolle45 · July 22, 2019, 12:37am

I now have a kernel which is dominated by ALU instructions, but the minimal latency of the kernel overall is several times longer than the number of instructions. I’m on a Kepler device. The latencies on later architectures will still exceed the instruction dispatch rate.

On the Kepler SMX, as I understand it, each warp is assigned to one warp scheduler, and each scheduler can hold a certain number of warps from which to choose the next instruction(s) to issue.

Of a pair of warp schedulers, only one of them can issue two instructions in a given cycle. I could assume that the schedulers have some method of deciding which one is the winner each cycle. Also each scheduler has to know if dual issue is allowed.

There is a GitHub project at [url]https://github.com/PAA-NCIC/PPoPP2017_artifact[/url], containing a Kepler binary <-> modified SASS file program, and a paper “sgemm.pdf” describing the program. It describes 8 control bits for each binary ISA instruction, which apparently tell the dispatcher how soon it may issue the instruction after the prior instruction. I gather that a stall time of 0 allows the dispatcher to issue the instruction at the same time as the previous instruction. If anyone can supply the exact encoding of these 8 bits, it would be useful.

Specific questions about this:
(1) If both warps allow dual issue, will one warp always issue two instructions and the other one instruction? (2) Can the last instruction of a group of 7 be dual issued with the first instruction of the next group? (3) If either of two instructions is a non-ALU (i.e. load/store, etc.), and the other warp has two ALU instructions, will all four of these instructions be issued?

I will have other questions regarding memory latency and throughput, but these will wait until I can get my kernel running at 100% ALU usage without involving any memory instructions.

Robert_Crovella · July 22, 2019, 2:50am

AFAIK Kepler has a theoretical max issue rate of 8 instructions per clock per SM. This requires dual issue of course. It’s also dependent on the exact instructions, pipe availability, and other factors. Yes, it can only issue a max of 6 SP instructions in a given clock cycle (it can only issue SP for 6 warps total in a given clock cycle).

tera · July 22, 2019, 8:35pm

The most important obstacle on Kepler preventing dual-issue is register bandwidth starvation.

Registers on Kepler are banked using a partitioning scheme that is less-than-obvious and this has to be taken into account to achieve maximum or even decent throughput.

michaelrrolle45 · July 25, 2019, 2:59am

An example:

I’ve got kernel code which contains many repetitions of the same 6 instructions, each depending on the one before. Here’s an excerpt from the disassembled SASS file (via KeplerAs, which adds the stuff on the left). I added blank lines to separate groups of 7 instructions occupying 64 bytes of instruction memory along with the control flags.

-:-:D:-:04      SHR.U32 R8, R14, 0xa;
-:-:-:-:08      LOP.XOR R15, R15, R0;
-:-:-:-:08      LOP.XOR R15, R15, R8;
-:-:D:-:04      SHF.R.W R0, R15, 0x11, R15; (The next set of 6 starts here)
-:-:-:-:00      ST.E [R2+0x7c], R15;
-:-:-:-:03      SHF.R.W R17, R15, 0x13, R15;
-:-:-:-:04      SHR.U32 R8, R15, 0xa;

-:-:-:-:08      LOP.XOR R17, R17, R0;
-:-:-:-:08      LOP.XOR R17, R17, R8;
-:-:D:-:04      SHF.R.W R0, R17, 0x11, R17;
-:-:-:-:00      ST.E [R2+0x84], R17;
-:-:-:-:03      SHF.R.W R19, R17, 0x13, R17;
-:-:-:-:04      SHR.U32 R8, R17, 0xa;
-:-:-:-:08      LOP.XOR R19, R19, R0;

-:-:-:-:08      LOP.XOR R19, R19, R8;
-:-:D:-:04      SHF.R.W R0, R19, 0x11, R19;
-:-:-:-:00      ST.E [R2+0x8c], R19;
-:-:-:-:08      SHF.R.W R20, R19, 0x13, R19;
-:-:D:-:04      SHR.U32 R8, R19, 0xa;
-:-:-:-:08      LOP.XOR R20, R20, R0;
-:-:-:-:08      LOP.XOR R20, R20, R8;

-:-:D:-:04      SHF.R.W R0, R20, 0x11, R20;
-:-:-:-:00      ST.E [R2+0x94], R20;
-:-:-:-:08      SHF.R.W R21, R20, 0x13, R20;
-:-:D:-:04      SHR.U32 R8, R20, 0xa;
-:-:-:-:08      LOP.XOR R21, R21, R0;
-:-:-:-:08      LOP.XOR R21, R21, R8;
-:-:-:-:00      SHF.R.W R0, R21, 0x11, R21;

-:-:D:-:04      SHF.R.W R22, R21, 0x13, R21;
-:-:-:-:08      ST.E [R2+0x9c], R21;
-:-:D:-:04      LOP.XOR R22, R22, R0;
-:-:-:-:08      SHR.U32 R0, R21, 0xa;
-:-:-:-:08      LOP.XOR R22, R22, R0;
-:-:D:-:04      SHF.R.W R0, R22, 0x11, R22;
-:-:-:-:00      ST.E [R2+0xa4], R22;

-:-:-:-:08      SHF.R.W R6, R22, 0x13, R22;
-:-:-:-:00      LOP.XOR R6, R6, R0;
-:-:-:-:08      SHR.U32 R0, R22, 0xa;
-:-:-:-:08      LOP.XOR R6, R6, R0;
-:-:D:-:04      SHF.R.W R0, R6, 0x11, R6;
-:-:-:-:00      ST.E [R2+0xac], R6;
-:-:-:-:08      SHF.R.W R7, R6, 0x13, R6;

The D’s appear where the 0x20 bit of the control information is 0 (!). I don’t see any reason for them being where they are, and they are different for different sets of 6 instructions. However, the pattern repeats after the 42 instructions shown above.
The two-digit numbers are the stall field. Their arrangement is also irregular but repeating in groups of 42. Except for the lines 6 and 13 which have a value of 03, which does not appear anywhere else in the entire kernel. The stores always have 00. Other instructions have 00 when they are independent of the following instruction. So that suggests that the stall number is number of cycles to delay after dispatching the instruction.

Anyone want to figure this out or add an explanation??

tera · July 25, 2019, 9:46pm

From (my) memory the stall count is indeed the number of cycles to wait until the next instruction can execute (this avoids the need to load / decode instructions while they are not yet ready to run, just to determine how long they need to wait) My memory might not always be the best reference.

In order to determine the stall counts you need very detailed information about GPU internals, none of which are documented. But from a quick glance you might notice that the longest stalls (8 cycles) tend to occur when the output is immediately used in the next instruction.

michaelrrolle45 · July 27, 2019, 11:28pm

One question answered: The -:-:D:-:04 flags don’t have anything to do with 4 cycles stall. It’s merely the Kepler ISA encoding for “dual issue”. This is stated in the sgemm.pdf file cited in O P above.

Topic		Replies	Views
Two dispatch units in the Kepler.. is it possible execute two instructions in a warp at the same tim CUDA Programming and Performance	19	6217	April 19, 2013
Instruction Co-Issue on GK104 CUDA Programming and Performance	1	1710	June 20, 2012
Understanding CUDA scheduling CUDA Programming and Performance	4	15927	May 20, 2014
Fermi doesn't keep all execution units busy? CUDA Programming and Performance	2	4781	February 24, 2010
Warps and Occupancy CUDA Programming and Performance	4	4082	April 19, 2011
Fermi Warp Sheduling CUDA Programming and Performance	1	3063	September 30, 2011
warp scheduler of Fermi architecture CUDA Programming and Performance	2	3259	February 5, 2012
Understanding difference between instructions issued 1 and instructions issued 2 in computeprof (CUD CUDA Programming and Performance	6	1797	April 16, 2013
Warp threads execution model CUDA Programming and Performance	8	2849	January 19, 2010
Understanding fermi warp scheduler CUDA Programming and Performance	0	2401	December 2, 2011

Dual-issue and other timing behavior of the Kepler warp scheduler?

Related topics