How to efficiently use shared memory?

Edit: Figured this one out and was fussing over something that is taken care of on the driver level. Removed because I don’t want to potentially confuse others.

i think the first assumption is that shared memory would be the optimal solution
to what extent this is true, i do not know

you also seem to what to share shared memory between kernel blocks

what is the relationship between the complete 1024x1024 matrix, and the 32x32 blocks?
is there some kind of data reuse between/ across 32x32 blocks, hence the reason why you wish to (re)use shared memory, and also across kernel blocks

you could have blocks loop, based on a global memory atomic, and have the blocks increment their indices internally, based on the atomic
this way, a block essentially ‘becomes’ or ‘behaves as’ many blocks, and you can ‘share’ shared memory across blocks (as you never really change the block, only the block addressing)

You should check out Mark Harris’ blog post http://devblogs.nvidia.com/parallelforall/using-shared-memory-cuda-cc/

I think that will answer all of your questions