I think they have missed the part where you said “per multiprocessor”.
So if that is indeed the case, you have a lot more than 768 threads total, to run on the graphics card.
There is no secret recipe to block sizes. You try them for your specific problem, and find what the sweet spot is. And that sweet spot wont (necessarily) be the same for another problem.
So no, 5 threads per block would be terrible, since you have 8 SPs in an MP running in parallel, and the size of a warp is 32 threads. What MrAnderson was trying to say is that if your WHOLE GPU has, say, 14 multiprocessors, then you need to run at the very least 14 blocks to keep the card occupied.
In your case, if that “768 per MP” figure is correct, you have to find what the sweet spot is.