global memory broadcast? reading same global memory location with multiple blocks

Ok, so I want to read an array of 1024 ints in global memory by 2048 blocks, each with 1024 threads. The idea is that each block will load the same array of ints from global memory into shared memory and then I can do scattered read access without heavy penalties. The question is, does global memory have a broadcast mechanism like that in shared memory? Or will reads to the same global location by multiple threads be serialized? I’m not doing any writing to the array, pure reads.

Also, I’m using Fermi hardware, so I’m wondering, does L2 cache help with this, and does L2 cache have a broadcast mechanism?

Thanks,

Wen

Make sure to access your memory in a coalesced pattern and you should be fine. This is the ideal case for the L2/L1 caches, you should get a broadcast from the L2 to the L1, then from the L1 into shared memory. You should get perfect reuse from both levels of the cache as long as something else doesn’t kick out your working set. Just make sure to do coalesced accesses otherwise you can get serialization through the L1.

Also, you may want to play around with the cache modifiers since I don’t think global accesses get cached in the L1 by default, and this application could probably benefit from reuse between blocks.

Great, thanks a lot for the info. Yeah, I do coalesced read into the shared memory, then scattered access on the shared memory. The data is perfectly reused between blocks. Will look at cache modifiers now.