I now have a kernel which is dominated by ALU instructions, but the minimal latency of the kernel overall is several times longer than the number of instructions. I’m on a Kepler device. The latencies on later architectures will still exceed the instruction dispatch rate.
On the Kepler SMX, as I understand it, each warp is assigned to one warp scheduler, and each scheduler can hold a certain number of warps from which to choose the next instruction(s) to issue.
Of a pair of warp schedulers, only one of them can issue two instructions in a given cycle. I could assume that the schedulers have some method of deciding which one is the winner each cycle. Also each scheduler has to know if dual issue is allowed.
There is a GitHub project at [url]https://github.com/PAA-NCIC/PPoPP2017_artifact[/url], containing a Kepler binary <-> modified SASS file program, and a paper “sgemm.pdf” describing the program. It describes 8 control bits for each binary ISA instruction, which apparently tell the dispatcher how soon it may issue the instruction after the prior instruction. I gather that a stall time of 0 allows the dispatcher to issue the instruction at the same time as the previous instruction. If anyone can supply the exact encoding of these 8 bits, it would be useful.
Specific questions about this:
(1) If both warps allow dual issue, will one warp always issue two instructions and the other one instruction? (2) Can the last instruction of a group of 7 be dual issued with the first instruction of the next group? (3) If either of two instructions is a non-ALU (i.e. load/store, etc.), and the other warp has two ALU instructions, will all four of these instructions be issued?
I will have other questions regarding memory latency and throughput, but these will wait until I can get my kernel running at 100% ALU usage without involving any memory instructions.