Option 2, because option 1 is physically impossible. When you launch a grid, every block must have the same execution parameters. You cannot have different block configurations within the same kernel run.
Neither is really better. Whether you run 2 or 3 blocks, you are still only making use of a few % of the hardware’s capabilities and the launch overhead will likely dominate your kernel’s execution time.
Edit: To make it a little more clear, you can probably run 30 or 60 blocks in about the exact same time as it would take to run 1 of the same size due to the parallel nature of the hardware.
I think they have missed the part where you said “per multiprocessor”.
So if that is indeed the case, you have a lot more than 768 threads total, to run on the graphics card.
There is no secret recipe to block sizes. You try them for your specific problem, and find what the sweet spot is. And that sweet spot wont (necessarily) be the same for another problem.
So no, 5 threads per block would be terrible, since you have 8 SPs in an MP running in parallel, and the size of a warp is 32 threads. What MrAnderson was trying to say is that if your WHOLE GPU has, say, 14 multiprocessors, then you need to run at the very least 14 blocks to keep the card occupied.
In your case, if that “768 per MP” figure is correct, you have to find what the sweet spot is.