How better split threads between block/grid ?

If there are pretty small amount of threads how better (and why) to split them between block and grid ?
For example: with 240 available threads and 9600GSO GPU (12 MPs total) what is better ?:

  1. to form grid of 12 blocks ( number of MPs available) each with 20 threads
  2. to form grid of 4 blocks with 64 threads in 3 blocks and 48 threads in 4th block

Warp size is 32, so you definitely want to keep block size to be a multiple of 32 as much as possible.

Did I understand you correct that the better way will be to use block size of 32 and launch 8 blocks each 32 threads long ?

That way 7 blocks will have full warp… but not all multiprocessors will be used.

So the question is what better - to leave some multiprocessors unused at all but have as many full warps as possible or to maximize usage of available multiprocessors leaving each block underfilled?

This is a difficult enough question to answer that you should probably benchmark a few different configurations to see what works best with your code. Given that instruction scheduling is done at the warp level, I suspect that 12 blocks * 20 threads will run at the same speed or slower than 8 blocks * 32 threads.

That means that you should keep block size a multiple of 32, not exactly 32.

Answering your original question, 240 threads is simply not enough to fill GPU. Theoretically 12x20 should be faster than 4x64, but none of those will fill GPU.