We are getting some strange results from a CUDA kernel we are playing with here. This kernel is not really dependant on the number of threads/block, so we are using that as a tuning metric to gain performance. We have noticed that what we measure vs what the CUDA occupancy calculator tells us do not match up:
When we set the number of threads per block equal to the number of actual processors per SIMD group on the card, performance was drastically improved (20MS -> 10MS) without changing any other code, even though the occupancy calculator said we were moving from >50% occupancy to 33% occupancy. This change did not affect the amount of data being processed by the kernel.
The perfomance improved when we increased the number of registers used in the kernel, even though the calculator said we were taking a large hit in the occupancy.
On a simple program, setting the number of threads per block to any number bigger then the number of processors in a single SIMD group leads to a large hit in performance, even though the kernel is not using any shared memory.
I can not post the kernel, however I can say that these kernels are completely independant of each other (i.e they do not depend on anything generated in any other kernel).
This leads me to some questions:
What does the occupancy calculator calculate, exactly?
How could it be that a program with near 100% occupancy could run slower then a program with 33% occupancy, assuming that it is doing the same work and written efficiently?