Dual-issue and other timing behavior of the Kepler warp scheduler?

I now have a kernel which is dominated by ALU instructions, but the minimal latency of the kernel overall is several times longer than the number of instructions. I’m on a Kepler device. The latencies on later architectures will still exceed the instruction dispatch rate.

On the Kepler SMX, as I understand it, each warp is assigned to one warp scheduler, and each scheduler can hold a certain number of warps from which to choose the next instruction(s) to issue.

Of a pair of warp schedulers, only one of them can issue two instructions in a given cycle. I could assume that the schedulers have some method of deciding which one is the winner each cycle. Also each scheduler has to know if dual issue is allowed.

There is a GitHub project at https://github.com/PAA-NCIC/PPoPP2017_artifact, containing a Kepler binary <-> modified SASS file program, and a paper “sgemm.pdf” describing the program. It describes 8 control bits for each binary ISA instruction, which apparently tell the dispatcher how soon it may issue the instruction after the prior instruction. I gather that a stall time of 0 allows the dispatcher to issue the instruction at the same time as the previous instruction. If anyone can supply the exact encoding of these 8 bits, it would be useful.

Specific questions about this:
(1) If both warps allow dual issue, will one warp always issue two instructions and the other one instruction? (2) Can the last instruction of a group of 7 be dual issued with the first instruction of the next group? (3) If either of two instructions is a non-ALU (i.e. load/store, etc.), and the other warp has two ALU instructions, will all four of these instructions be issued?

I will have other questions regarding memory latency and throughput, but these will wait until I can get my kernel running at 100% ALU usage without involving any memory instructions.

AFAIK Kepler has a theoretical max issue rate of 8 instructions per clock per SM. This requires dual issue of course. It’s also dependent on the exact instructions, pipe availability, and other factors. Yes, it can only issue a max of 6 SP instructions in a given clock cycle (it can only issue SP for 6 warps total in a given clock cycle).

The most important obstacle on Kepler preventing dual-issue is register bandwidth starvation.

Registers on Kepler are banked using a partitioning scheme that is less-than-obvious and this has to be taken into account to achieve maximum or even decent throughput.

An example:

I’ve got kernel code which contains many repetitions of the same 6 instructions, each depending on the one before. Here’s an excerpt from the disassembled SASS file (via KeplerAs, which adds the stuff on the left). I added blank lines to separate groups of 7 instructions occupying 64 bytes of instruction memory along with the control flags.

-:-:D:-:04      SHR.U32 R8, R14, 0xa;
-:-:-:-:08      LOP.XOR R15, R15, R0;
-:-:-:-:08      LOP.XOR R15, R15, R8;
-:-:D:-:04      SHF.R.W R0, R15, 0x11, R15; (The next set of 6 starts here)
-:-:-:-:00      ST.E [R2+0x7c], R15;
-:-:-:-:03      SHF.R.W R17, R15, 0x13, R15;
-:-:-:-:04      SHR.U32 R8, R15, 0xa;

-:-:-:-:08      LOP.XOR R17, R17, R0;
-:-:-:-:08      LOP.XOR R17, R17, R8;
-:-:D:-:04      SHF.R.W R0, R17, 0x11, R17;
-:-:-:-:00      ST.E [R2+0x84], R17;
-:-:-:-:03      SHF.R.W R19, R17, 0x13, R17;
-:-:-:-:04      SHR.U32 R8, R17, 0xa;
-:-:-:-:08      LOP.XOR R19, R19, R0;

-:-:-:-:08      LOP.XOR R19, R19, R8;
-:-:D:-:04      SHF.R.W R0, R19, 0x11, R19;
-:-:-:-:00      ST.E [R2+0x8c], R19;
-:-:-:-:08      SHF.R.W R20, R19, 0x13, R19;
-:-:D:-:04      SHR.U32 R8, R19, 0xa;
-:-:-:-:08      LOP.XOR R20, R20, R0;
-:-:-:-:08      LOP.XOR R20, R20, R8;

-:-:D:-:04      SHF.R.W R0, R20, 0x11, R20;
-:-:-:-:00      ST.E [R2+0x94], R20;
-:-:-:-:08      SHF.R.W R21, R20, 0x13, R20;
-:-:D:-:04      SHR.U32 R8, R20, 0xa;
-:-:-:-:08      LOP.XOR R21, R21, R0;
-:-:-:-:08      LOP.XOR R21, R21, R8;
-:-:-:-:00      SHF.R.W R0, R21, 0x11, R21;

-:-:D:-:04      SHF.R.W R22, R21, 0x13, R21;
-:-:-:-:08      ST.E [R2+0x9c], R21;
-:-:D:-:04      LOP.XOR R22, R22, R0;
-:-:-:-:08      SHR.U32 R0, R21, 0xa;
-:-:-:-:08      LOP.XOR R22, R22, R0;
-:-:D:-:04      SHF.R.W R0, R22, 0x11, R22;
-:-:-:-:00      ST.E [R2+0xa4], R22;

-:-:-:-:08      SHF.R.W R6, R22, 0x13, R22;
-:-:-:-:00      LOP.XOR R6, R6, R0;
-:-:-:-:08      SHR.U32 R0, R22, 0xa;
-:-:-:-:08      LOP.XOR R6, R6, R0;
-:-:D:-:04      SHF.R.W R0, R6, 0x11, R6;
-:-:-:-:00      ST.E [R2+0xac], R6;
-:-:-:-:08      SHF.R.W R7, R6, 0x13, R6;

The D’s appear where the 0x20 bit of the control information is 0 (!). I don’t see any reason for them being where they are, and they are different for different sets of 6 instructions. However, the pattern repeats after the 42 instructions shown above.
The two-digit numbers are the stall field. Their arrangement is also irregular but repeating in groups of 42. Except for the lines 6 and 13 which have a value of 03, which does not appear anywhere else in the entire kernel. The stores always have 00. Other instructions have 00 when they are independent of the following instruction. So that suggests that the stall number is number of cycles to delay after dispatching the instruction.

Anyone want to figure this out or add an explanation??

From (my) memory the stall count is indeed the number of cycles to wait until the next instruction can execute (this avoids the need to load / decode instructions while they are not yet ready to run, just to determine how long they need to wait) My memory might not always be the best reference.

In order to determine the stall counts you need very detailed information about GPU internals, none of which are documented. But from a quick glance you might notice that the longest stalls (8 cycles) tend to occur when the output is immediately used in the next instruction.

One question answered: The -:-:D:-:04 flags don’t have anything to do with 4 cycles stall. It’s merely the Kepler ISA encoding for “dual issue”. This is stated in the sgemm.pdf file cited in O P above.