I have a pairwise comparison problem in which I have 160x160 threads. If I use 32x32 threads per block, I will have 5x5=25 blocks. I have a GTX470 with 1.25GB VRAM. Due to memory limit, I can only run three blocks per kernel launch. After 9 launches, I clocked 21min run time.
I suspect the slow speed might be due to the fact that only three of the 14 MPs are used each time. So I reduced the threads per block to 16x16. I now have 10x10=100 blocks. I can now run at most 15 blocks per kernel launch. After 7 launches, I clocked 14.5min run time.
Since each MP can only work on one block at a time, I suspect I might have wasted one cycle by running 15 blocks per kernel launch. So I finally try to run 14 blocks per launch. After 8 launches, I clocked 13min run time.
Based on what I learned from this experience, does that mean when the number of blocks per kernel launch is small, I should always try to make it multiples of number of MPs (for 470, this is 14)???