Question: I have read that SPs are loaded with multiple blocks at a time according to the amount of shared memory needed per block. For example : each block needs 4kb then 4 blocks can be loaded at a time.
So my question is that does these 4 blocks which have been loaded execute simultaneously that is they are in the scheduling line and if yes how is it made sure that there is no overlap of data in the shared memory and does our coding need to take into account any specific addressing style for the shared memory so as to make sure there is no data overlap.
Or is this factor invisible to the programmer.
I am new to cuda programming and still trying to properly grasp the coding tactics thus kindly pardon me if its an irrelevant question.
It is invisible. You just address shared memory in the kernel and the hardware figures out what to offset it by so that multiple running blocks don’t write into each others memory.
I assume CUDA would not protect multiple blocks running on one SM from stomping over each other’s shared memory space, as may happen when doing out-of-bounds array writes.
The GPU does have protected memory for global memory (one context cannot read/write another context’s global mem). Whether that extends to shared memory is unknown. But I would guess that there is no protection.
I seem to recall wumpus doing experiments to find what else was stored in shared memory. He was accessing a shared memory array with a negative index to go outside the region allocated to the block, or something like that.