Redundant loads of data into shared memory?

I’m designing a kernel all of whose threads process the same immutable dataset that’s loaded into shared memory, by each block. This means that the dataset would be redundantly reloaded into memory even when a copy has already been uploaded by the previous block to execute on the SM. The redundancy gets worse when multiple blocks are executed on the same SM, i.e. multiple copies of the dataset now exist in shared memory on the same SM.

Am I thinking about this wrongly? Is there a way around these redundant loads.

The only way to load shared memory is on a per block basis. Shared memory is logically associated with a particular block.

There is no way to reuse the contents of shared memory that was loaded by another block.

If you want to minimize this effect, then make sure that you are only creating enough blocks in your kernel launch to saturate the GPU (upper bound: 2048 threads per SM), but no more. Then, if necessary, put a loop of some sort in your kernel to process additional activity without reloading shared mem.