In paper [1], authors measure pipeline latency on several graphic card and reports in table 1.
the table shows that register-to-register MAD (multiply-and-add) instruction runs at 24 cycles.
and authors argue “24 cycle latency may be hidden by running simultaneously 6 warps (or 192 threads) per SM”.
this match description section 5.1.2.6 n programming guide,
"Generally, accessing a register is zero extra clock cycles per instruction, but delays may occur due to register read-after-write dependencies and register memory bank conflicts.
The delays introduced by read-after-write dependencies can be ignored as soon as there are at least 192 active threads per multiprocessor to hide them"
My question is : how does scheduler dispatch warps in a SM? Two methods,
Method 1 : Warp occupies SPs till memory-access instruction is executed.
Method 2 : Each warp execute one instruction in turn.
In section 4.1 of programming guide, it says “Every instruction issue time, the SIMT unit selects a warp that is ready to execute and issues the next instruction to the active threads of the warpâ€.
It seems that hardware supports method 2.
I take an example to show method 1 and method 2.
Example : execute three instructions S1, S2 and S3 in turn
S1 : a <-- a * b + c; // register read-after-write dependence
S2 : a <-- a * b + c; // register read-after-write dependence
S3 : odata[index] <-- a;// read operation
we show Gatt chart of method 1 in figure 1 and Gatt chart of method 2 in figure 2.
figure 1,
figure 2,
Reference: [1] Vasily Volkov, James W. Demmel, Benchmarking GPUs to Tune Dense Linear Algebra