Optimizing App Memory Bandwidth Requirements Optimizing App Memory Bandwidth Requirem

Let’s say we have 128 threads per block and 6,144 blocks. Now let’s say there is a large data structure (megabytes worth) in device memory that is read in a linear fashion from start to finish by all threads such that every thread performs identical reads, and there are no branches in the kernel. Theoretically, all threads would do the identical fetch at exactly the same time, and should be able to share the single fetch from device memory. In practice (because we use more threads/blocks than the GPU actually has), only those threads/blocks currently executing could hope to share the same fetch. The sharing of the fetch would reduce bandwidth consumption considerably and thus could greatly increase performance for a bandwidth-bound kernel.

I’d like to know the implications for memory bandwidth as the software and hardware is currently designed. Is there any sharing of memory fetches between threads in the same block, or even among different blocks, when the memory fetch is done simultaneously from the same address? Is there some cache that exists that would allow for a single fetch from device memory to the cache to satisfy all requesting threads/blocks by returning the fetched value from that cache?

I am concerned that increasing the number of threads or blocks may actually increase device memory bandwidth requirements of my application needlessly if there is no HW to support combining memory accesses into a single access for multiple threads/blocks.

Can someone please comment on my concern?


as I understand , there is no cache. However, if many threads in the block read the same global memory address, you better try something like this in the kernel (pseudocode)

check_if_I_am_thread0 {
// if so
do global memory read
put value to shared memory
__synchThreads(); // make sure all threads wait until the zero one will fetch

// now threads access this value from the shared memory, without global read, so shared memory IS THE CACHE

Thanks for the tip. That can work within a single block. But my understanding is you need to use multiple blocks to gain access to more than one multiprocessor. Is that correct? In this case, each block is doing a fetch of the same value from memory, right? Tell me, do all multiprocessors share from the same “global” pool of memory bandwidth to device memory, or does each multiprocessor have a dedicated path to device memory? (In the latter case total device bandwidth would be estimated by summing the bandwidth that each multiprocessor has to device memory.)



cuda does not reveal that technical details, however, normally only a few blocks can be run really parallel (concurently) to each other. So even if block-reads will be serialized this is not as bad as if each THREAD-read would have been serialized, because you normally have hunderds of threads per block.

it was theory. now practice sais you should really try to see what happens ;-)

The maximum amount of blocks that do run concurrently is 8 blocks per MP * number of MP (16)
So that is 128 blocks at the same time.

Memory reads are not cached. If you have every thread read every value from global memory: every thread does just that and your app will be very slow. Especially since such reads will also not be coalesced.

The most efficient way to loop through all values in global memory with each thread is to use the shared memory as a sliding window cache. The advantage of this method over just having thread 0 perform the read is that all threads participate in a coalesced read, fully utilizing the memory bandwidth of the device.

Example (no bounds checking for data sizes not a multiple of the block size, this is left as an exercise to the reader):

__shared__ int sdata[BLOCK_SIZE];

for (int cur_offset = 0; cur_offset < array_size; cur_offset += blockDim.x)



    sdata[threadIdx.x] = data_array[cur_offset + threadIdx.x];


   for (int cur_val = 0; cur_val < blockDim.x; cur_val++)



        // doSomething, whatever



Yes, I know this only shares reads among each block so that each value is read NBLOCKS times. There is no way around this, however. CUDA’s data-parallel architecture offers no inter-block cache. The data-parallel model can’t in general if you think about it, because not all blocks are executing simultaneously and the interleaved execution model means that different blocks will always be at different portions of the calculation.

Edit: To answer your later question: all blocks share the same pool of device memory.

You could try using textures since they ARE cached…and would require slightly less programming effort and a result in a slightly cleaner kernel.

I don’t know which method would be faster though - you probably would have to try both.

The shared memory method is faster: I have tried.

The texture “cache” is effective at making uncoalsced memory reads, but threads within a warp must still access values with good data locality. While having all threads in a warp read the same value from the texture counts as good data locality, it still does not boost performance over the raw device memory throughput obtainable via coalesced reads. Try it yourself: you will get an effective memory bandwidth of ~70 GiB/s (8800 GTX), no faster.

The shared memory version will (…should) also get 70 GiB/s, but the actual amount of memory transfered is reduced by a factor of 1/BLOCK_SIZE due to the shared reads. Depending on the amount of calculations per element, the full memory bandwidth of the device may not be realized due to the computation being the bottleneck.