How do you determine the number of blocks that can be run simultaneously by a gpu device ?
If my device has 14 multiprocessors, should the number of allocated blocks be multiple of 14
so that the device is fully occupied at any time?
I am trying to do a specified number of monte-carlo simulations (N) on gpu. It is more efficient
to launch the minimum number of blocks that fully engage the device and loop as many times as required,
than launch many blocks with fewer loop count. For example for N = 524288 simulation, it is better to launch 32 blocks
each doing 16384 simulations, than launching 512 blocks each doing 1024 simulations.
Can I determine the minumum number of blocks by inspecting device properties alone ?
it is hard to guess the optimal work distribution for any problem without a little benchmarking.
So my basic answer is, try the setups.
Also use the Excel sheet to calculate occupancy.
In general you want to run the maximum number of threads per MP.
So find out how many threads you can run and how much blocks you need for that per MP.
This can be limited by the registers and shared memory you need or by the maximum number of blocks (i.e. 8 ??) runnable per MP:
The maximum number of blocks per MP is defined in the programming guide.
Then multiply that count by the number of multiprocessors and you are at the minimum degree of parallelism you want to have.
If your kernel runs very fast, then load balance is also a problem and you really have to stay at multiples of the MP.
As long as you have enough work per block there should be now overhead in scheduling the blocks. Looping over elements has more impact on performance as far as I experienced.