Two interesting questions… perhaps someone knows the answer, otherwise they can be answered with a little experimentation.
In G200, there’s a (undocumented but well known) 24 clock register read-after-write pipeline latency. The effect of this is reduced instruction throughput when an SM has less than 6 warps (192 threads) active (though it’s fine if they’re in different blocks.) This is a useful rule of thumb to make sure you’re not wasting instruction throughput.
What’s Fermi’s register pipeline latency? Is the same 6 warp heuristic applicable? This may be tricky to answer with the new dual scheduler.
In the same line of research, with G200, registers are allocated in chunks of 256 per block, with a quantization of 4 registers per thread and (for allocation only) thread count rounded up to a multiple of 64 . Again these details are undocumented though it’s supported by this super-interesting microbenchmarking paper.) So for example a kernel using 9 registers per thread would actually allocate (the rounded-up) 12 registers per thread. And a 32 thread (1 warp) kernel would use just as many registers per block as a 64 thread block… 96 thread blocks would use the same number of registers as a 128 thread block. (Perhaps these details are in the formulas in the occupancy calculator?)
Warp now has 2 clocks execution time, so you need more warps for sure to hide latency. Btw, number if multi processors was reduced, so each should perfrom more blocks to get more performance.
Does it?, before there were 8 SP’s so 4 clocks to issue a warp. Now 32 SP’s so 1 clock to issue a warp (okay, two parallel in 2 clocks, but net it is 1 clock). So if the number of warps needed only doubles, it means the pipeline is half as deep as far as I understand the whole pipeline-depth issue ;)