Hei there!
I am currently studying a paper on histograms, where of course they want to maximize the amount of shared memory they can use ( makes the number of bin they can have higher).
This led me to a question I did not find answer around.
Let say my card has N Kb of shared memory per SM, and the card can map at most M blocks per SM at the time.
My question is, meanwhile writing my kernel should I program thinking I have N Kb of memory available per block, and if there is not enough shared memory available for multiple blocks to be mapped, the driver won’t just map them, or the effective shared memory I can use is N/M kb per block?
Any info on the matter would be really appreciated.
M.
This generally a tradeoff which may affect occupancy which may affect performance.
Briefly, if you used the maximum of 48kb per block, then you would have a max occupancy of 1 block per SM. This might not be the most performance, so using less per block (e.g. 32kb, or 16kb) might yield substantial increases in performance.
The CUDA occupancy calculator may be interesting to experiment with.
Yes, I was able to guess that from the paper, meaning that occupancy was going to be a problem, the thing was if the kernel would out right crash saying I am asking too many resources or will be the driver to manage that, mapping less blocks per sm, with less occupancy that is, but your answer actually checks my point, the driver will take care of it so unless each block doesn’t ask for more than the max shared memory I wont get a crash.
Thanks a lot for your answer.
M.