Do we need to be conscious of the number of MPs in our GPU?

I have a pairwise comparison problem in which I have 160x160 threads. If I use 32x32 threads per block, I will have 5x5=25 blocks. I have a GTX470 with 1.25GB VRAM. Due to memory limit, I can only run three blocks per kernel launch. After 9 launches, I clocked 21min run time.

I suspect the slow speed might be due to the fact that only three of the 14 MPs are used each time. So I reduced the threads per block to 16x16. I now have 10x10=100 blocks. I can now run at most 15 blocks per kernel launch. After 7 launches, I clocked 14.5min run time.

Since each MP can only work on one block at a time, I suspect I might have wasted one cycle by running 15 blocks per kernel launch. So I finally try to run 14 blocks per launch. After 8 launches, I clocked 13min run time.

Based on what I learned from this experience, does that mean when the number of blocks per kernel launch is small, I should always try to make it multiples of number of MPs (for 470, this is 14)???

I’ve noticed the same thing when I have to run very small numbers of blocks.

Perhaps it is just a matter of occupancy. with 32x32 blocks, no SM (or MP) will hold more than 1 block at a time, but with 16x16 you might be able to push more with enough resources in the SM for the blocks (the SM will will be able to take at most 6 blocks of 16x16 size).

What could explain the magical property of the exact number of SMs is that perhaps with exactly fitting number of blocks, there will be no blocks waiting to be scheduled to an SM, which reduces some kind of context switch overhead.

Either or both of these things could be what you are experiencing, however these are also just my thoughts, I could be wrong.