Fermi register allocation and read-after-write latency The undocumented details for efficient occupa

Two interesting questions… perhaps someone knows the answer, otherwise they can be answered with a little experimentation.

In G200, there’s a (undocumented but well known) 24 clock register read-after-write pipeline latency. The effect of this is reduced instruction throughput when an SM has less than 6 warps (192 threads) active (though it’s fine if they’re in different blocks.) This is a useful rule of thumb to make sure you’re not wasting instruction throughput.

What’s Fermi’s register pipeline latency? Is the same 6 warp heuristic applicable? This may be tricky to answer with the new dual scheduler.

In the same line of research, with G200, registers are allocated in chunks of 256 per block, with a quantization of 4 registers per thread and (for allocation only) thread count rounded up to a multiple of 64 . Again these details are undocumented though it’s supported by this super-interesting microbenchmarking paper.) So for example a kernel using 9 registers per thread would actually allocate (the rounded-up) 12 registers per thread. And a 32 thread (1 warp) kernel would use just as many registers per block as a 64 thread block… 96 thread blocks would use the same number of registers as a 128 thread block. (Perhaps these details are in the formulas in the occupancy calculator?)

Does register allocation differ for Fermi?

In general, you want more warps per SM on Fermi. I forget the exact number or specifically why this is the case, I think it’s 10-12.

And yes, register allocation does differ. The occupancy calculator should be correct, though.

12 warps would make a lot of sense since it’d be double G200, and GF100 has dual scheduled warps.

The occupancy calculator doesn’t have (explicit) device 2.0 support yet.

Isn’t that also included with the SDK? I’m pretty sure there’s an updated version floating around…

Warp now has 2 clocks execution time, so you need more warps for sure to hide latency. Btw, number if multi processors was reduced, so each should perfrom more blocks to get more performance.

Wow, Listing 7 clarifies how multiple divergent warps can work with [font=“Courier New”]__syncthreads()[/font]. This is useful to know.

Does it?, before there were 8 SP’s so 4 clocks to issue a warp. Now 32 SP’s so 1 clock to issue a warp (okay, two parallel in 2 clocks, but net it is 1 clock). So if the number of warps needed only doubles, it means the pipeline is half as deep as far as I understand the whole pipeline-depth issue ;)