Block Count

Simple question:

If the CUDA Occupancy Calculator states that your kernel is limited to 8 blocks per multiprocessor, then inorder to effectively load balance, the number of blocks you execute should be a multiple of the number of multiprocessors multiplied by the maximum blocks per multiprocessor:

Example:

    My kernel is limited to 8 blocks per multiprocessor
    My graphics card is a GTX280 equating to 30 multiprocessors

30 * 8 = 240 blocks per device (or multiple of) to ensure the grid is balanced across the hardware

yes, that is optimal if each block takes the same amount of time. If you would have 241 blocks for example, your total running time might be 2x the running time for 240 blocks.