I ran the occupancy calculator for 2 version of my code. One version uses 15 registers and the other uses 11 registers. In both cases, I use a block size of 128 threads. The 15 register version basically stored more locals in registers (e.g. calculations like 2*x which is used in array indexing and other common expressions is precalculated calculated and stored in a register). Both cases uses 48 words(?) of shared memory according to the cubin output.
According to the spreadsheet, the fact I used 4 more registers should have dropped my GPU occupancy from around 80% to 67%.
But running the CUDA profiler reports that the 11 register version runs a fair bit slower (2100 microseconds) compared to the 15 register version (1700 microseconds).
I am happy to see my effort at optimization has paid off (400 microseconds better) but it doesn’t seem to match what the occupancy calculatorpredicts. Has anyone come across a reasonable explanation?
BTW, should the nvcc compiler be doing common expression optimization when I am using -O3 rather than me doing it by hand?