Occupancy doesn't tally with calculator

First off - I’m using version 2.0 on a Tesla C870 (CC 1.0).

I have a kernel using up 47 registers. It’s currently suffering from low occupancy and I’m fairly sure increasing the occupancy will make it zoom a bit. Shared memory is not a problem - only 568 bytes per block.

First job was to stick in -maxrregcount = 42. This cuts the reg count without dipping into local memory… probably at the cost of some computational time. I then go to the occupancy calculator and notice that if I set my block size at 96, I should be able to get two blocks per MP - expected occupancy 25%. Excellent.

I do this, and then run the profiler. Occupancy 12.5%. Curses. Slowly lowering the maximum number of registers I’m letting it use starts dipping into lmem. When I get it down to 32 I suddenly get the higher occupancy I was expecting earlier (25%). Speed seems about comparable to 16.7% occupancy with 128 threads per block and 47 registers. Increasing occupancy has clearly given me a hefty chunk of speed to counteract that huge amount of local memory I’m now using.

I’m baffled by this outcome. At 32 registers per thread I should be only using 6144 of my 8192 registers. At 42 I should be using all of my 8192 registers (both sums from occupancy calculator… 42 * 96 is a smidgen lower than 8192… but there seems to be a slightly more complicated way to work it out than that).

Why am I not getting 25% occupancy with 42 registers?

When I fill in 96 threads and 42 registers in the occupancy calculator I get 13% occupancy…

This is the formula calculating the amount of registers required by 1 block:
=CEILING(MyWarpsPerBlock*2; 4)16MyRegCount

So an odd amount of warps per block needs an extra warp worth of registers (and thus is better to avoid), and the minimum is 2 warps worth of registers.

If you use 64 or 192 threads per block, you will get the 25% occupancy (3 or 1 blocks per MP)

Righto - I’ll try 64 and 192 and see which is better. Ta.

What version of the calculator are you using? The latest version I can find is 1.4… but given yours seems to be working correctly I imagine there is a newer one.

This calculation has been there since v1.0 I believe. I am using the one delivered with 2.1 beta, dated 21-06-2008