If the CUDA Occupancy Calculator states that your kernel is limited to 8 blocks per multiprocessor, then inorder to effectively load balance, the number of blocks you execute should be a multiple of the number of multiprocessors multiplied by the maximum blocks per multiprocessor:
Example:
My kernel is limited to 8 blocks per multiprocessor
My graphics card is a GTX280 equating to 30 multiprocessors
30 * 8 = 240 blocks per device (or multiple of) to ensure the grid is balanced across the hardware
yes, that is optimal if each block takes the same amount of time. If you would have 241 blocks for example, your total running time might be 2x the running time for 240 blocks.