Let’s say we have 128 threads per block and 6,144 blocks. Now let’s say there is a large data structure (megabytes worth) in device memory that is read in a linear fashion from start to finish by all threads such that every thread performs identical reads, and there are no branches in the kernel. Theoretically, all threads would do the identical fetch at exactly the same time, and should be able to share the single fetch from device memory. In practice (because we use more threads/blocks than the GPU actually has), only those threads/blocks currently executing could hope to share the same fetch. The sharing of the fetch would reduce bandwidth consumption considerably and thus could greatly increase performance for a bandwidth-bound kernel.
I’d like to know the implications for memory bandwidth as the software and hardware is currently designed. Is there any sharing of memory fetches between threads in the same block, or even among different blocks, when the memory fetch is done simultaneously from the same address? Is there some cache that exists that would allow for a single fetch from device memory to the cache to satisfy all requesting threads/blocks by returning the fetched value from that cache?
I am concerned that increasing the number of threads or blocks may actually increase device memory bandwidth requirements of my application needlessly if there is no HW to support combining memory accesses into a single access for multiple threads/blocks.
Can someone please comment on my concern?