First off - I’m using version 2.0 on a Tesla C870 (CC 1.0).
I have a kernel using up 47 registers. It’s currently suffering from low occupancy and I’m fairly sure increasing the occupancy will make it zoom a bit. Shared memory is not a problem - only 568 bytes per block.
First job was to stick in -maxrregcount = 42. This cuts the reg count without dipping into local memory… probably at the cost of some computational time. I then go to the occupancy calculator and notice that if I set my block size at 96, I should be able to get two blocks per MP - expected occupancy 25%. Excellent.
I do this, and then run the profiler. Occupancy 12.5%. Curses. Slowly lowering the maximum number of registers I’m letting it use starts dipping into lmem. When I get it down to 32 I suddenly get the higher occupancy I was expecting earlier (25%). Speed seems about comparable to 16.7% occupancy with 128 threads per block and 47 registers. Increasing occupancy has clearly given me a hefty chunk of speed to counteract that huge amount of local memory I’m now using.
I’m baffled by this outcome. At 32 registers per thread I should be only using 6144 of my 8192 registers. At 42 I should be using all of my 8192 registers (both sums from occupancy calculator… 42 * 96 is a smidgen lower than 8192… but there seems to be a slightly more complicated way to work it out than that).
Why am I not getting 25% occupancy with 42 registers?