Ok, so I want to read an array of 1024 ints in global memory by 2048 blocks, each with 1024 threads. The idea is that each block will load the same array of ints from global memory into shared memory and then I can do scattered read access without heavy penalties. The question is, does global memory have a broadcast mechanism like that in shared memory? Or will reads to the same global location by multiple threads be serialized? I’m not doing any writing to the array, pure reads.
Also, I’m using Fermi hardware, so I’m wondering, does L2 cache help with this, and does L2 cache have a broadcast mechanism?