I’m designing a kernel all of whose threads process the same immutable dataset that’s loaded into shared memory, by each block. This means that the dataset would be redundantly reloaded into memory even when a copy has already been uploaded by the previous block to execute on the SM. The redundancy gets worse when multiple blocks are executed on the same SM, i.e. multiple copies of the dataset now exist in shared memory on the same SM.
Am I thinking about this wrongly? Is there a way around these redundant loads.