I’m working on a somewhat compute-intense kernel which I execute in N blocks of 32 threads each. These blocks process a large workload which can be equitably distributed over a large range of N.
I’m testing this kernel on a GTX 480, which has 15 SMs with 32 cores each and looking at the performance for different values of N=1…120, i.e. the maximum number of blocks given 15 SMs.
I was expecting the following behaviour:
For N=1…15, parallel efficiency should be rather good, since each block can run on its own SM.
As of N=15, we have more blocks than SMs and therefore the performance per block will start to decay since certain blocks will have to share the resources of a common SM.
As of N=120, the code shouldn’t scale at all since at most 8 blocks can be executed concurrently per SM.
This is more or less the behaviour I observe, yet with the following caveat: The parallel efficiency starts decaying as of N=12.
At N=12, I get a parallel efficiency of >99.5%. At N=13 I get 99.26%, at N=14 it’s 99.07%. This rate of decay continues until I get, for N=120, 77.13%. As of N=120, the performance does not scale at all.
So here’s the question: Why does this drop-off occur at N=12 and not, as I would expect, at N=15? Is there something odd about the scheduling of blocks to SMs that will make the GPU not fill all SMs at N=15?
I can exclude that there’s any other computation using the GPU. I have a separate adapter for the attached monitor, I use the machine remotely, and, if there were any other tasks, I would not be able to schedule 120 blocks, e.g. scaling would stop before that.
Any help in understanding this is much appreciated!