Fermi register allocation and read-after-write latency The undocumented details for efficient occupa

SPWorley · April 22, 2010, 12:26am

Two interesting questions… perhaps someone knows the answer, otherwise they can be answered with a little experimentation.

In G200, there’s a (undocumented but well known) 24 clock register read-after-write pipeline latency. The effect of this is reduced instruction throughput when an SM has less than 6 warps (192 threads) active (though it’s fine if they’re in different blocks.) This is a useful rule of thumb to make sure you’re not wasting instruction throughput.

What’s Fermi’s register pipeline latency? Is the same 6 warp heuristic applicable? This may be tricky to answer with the new dual scheduler.

In the same line of research, with G200, registers are allocated in chunks of 256 per block, with a quantization of 4 registers per thread and (for allocation only) thread count rounded up to a multiple of 64 . Again these details are undocumented though it’s supported by this super-interesting microbenchmarking paper.) So for example a kernel using 9 registers per thread would actually allocate (the rounded-up) 12 registers per thread. And a 32 thread (1 warp) kernel would use just as many registers per block as a 64 thread block… 96 thread blocks would use the same number of registers as a 128 thread block. (Perhaps these details are in the formulas in the occupancy calculator?)

Does register allocation differ for Fermi?

tmurray · April 22, 2010, 2:37am

In general, you want more warps per SM on Fermi. I forget the exact number or specifically why this is the case, I think it’s 10-12.

And yes, register allocation does differ. The occupancy calculator should be correct, though.

SPWorley · April 22, 2010, 2:47am

12 warps would make a lot of sense since it’d be double G200, and GF100 has dual scheduled warps.

The occupancy calculator doesn’t have (explicit) device 2.0 support yet.

tmurray · April 22, 2010, 2:50am

Isn’t that also included with the SDK? I’m pretty sure there’s an updated version floating around…

Lev · April 22, 2010, 10:57am

Warp now has 2 clocks execution time, so you need more warps for sure to hide latency. Btw, number if multi processors was reduced, so each should perfrom more blocks to get more performance.

allanmac · April 22, 2010, 3:17pm

Wow, Listing 7 clarifies how multiple divergent warps can work with [font=“Courier New”]__syncthreads()[/font]. This is useful to know.

E.D_Riedijk · April 22, 2010, 3:30pm

Does it?, before there were 8 SP’s so 4 clocks to issue a warp. Now 32 SP’s so 1 clock to issue a warp (okay, two parallel in 2 clocks, but net it is 1 clock). So if the number of warps needed only doubles, it means the pipeline is half as deep as far as I understand the whole pipeline-depth issue ;)

Topic		Replies	Views
how many threads to hide latency on Fermi? the number in the NVIDIA manual is 2x off (?) CUDA Programming and Performance	14	9887	August 21, 2010
Regarding Fermi SM organization change Impact of mem latency from 8 stream cores to 32 per SM CUDA Programming and Performance	9	3201	November 19, 2009
register allocation behaviour CUDA Programming and Performance	2	428	January 9, 2019
Scheduling on Fermi CUDA Programming and Performance	16	17542	August 9, 2010
Registers per SM GTX 460 CUDA Programming and Performance	7	1916	April 17, 2011
Question about Fermi 2.1 architecture of SM(s) of 48 cores and warps of 32 threads (from a Newbie) CUDA Programming and Performance	2	1927	December 6, 2015
Warp Size Question CUDA Programming and Performance	21	14006	June 18, 2010
Warp switching does anybody understands the mechanism CUDA Programming and Performance	16	8531	March 28, 2008
What is Warp Allocation Granulatity for? CUDA Programming and Performance	8	3085	March 21, 2017
[Fermi] Number of registers CUDA Programming and Performance	36	20239	September 15, 2010

Fermi register allocation and read-after-write latency The undocumented details for efficient occupa

Related topics