How many times does a value need to be reused before its worth putting into shared memory?

q.mot.tom.p · July 14, 2024, 1:13pm

Hello!

I am certain this question has been asked somewhere before - just cannot find where (so pointers to resources would be be great). But I’m wondering how many times a variable needs to be reused to be worth loading into shared memory?

In my current problem, each thread in the thread block accesses the same seventeen float values (once each) - so would I be correct in that these values would be worth loading into shared memory?

I wondering if there is a good rule of thumb of how many times a piece of data needs to be reused before shared memory becomes beneficial? Or is it very program specific?

Thanks

Robert_Crovella · July 14, 2024, 6:21pm

More recent GPUs have gotten better at caching (larger caches, basically). In my experience, if the L1 cache is effective, then the additional benefit of shared memory is pretty small, and you would want a reuse factor of more than 2 or 3 times for each element. I think if you look hard enough you will find people (recently, like in the last 5 years) who have written forum posts asking why their shared-memory-refactor-optimization didn’t provide any significant benefit for a reuse factor of ~3x.

It’s going to be program-dependent on whether L1 cache is getting thrashed or not, as well as whether shared memory vs. global latency is actually a limiting factor (and therefore noticeable as a performance optimization).

You can write a directed test to answer this yourself.

Curefab · July 14, 2024, 7:32pm

Besides the reuse, a large advantage of shared memory is that as long as the accesses are distributed across the banks, the accesses do not have to be coalesced, but can be random.

The coalescing requirements of recent architectures for global memory accesses have also been reduced a lot, so shared memory is needed less here than in early architectures. The coalescing requirements for global memory are now
A) As long as the overall memory block is aligned and the overall memory block is coalesced, the threads can load individual parts of the memory block.
B) The performance is good enough, if the accesses are within four 32 byte blocks instead of one 128 byte block.

But still, this ability to have random accesses for each thread, is useful in many cases.

A typical scenario is redistributing data with different alignment to 32 threads

Often you see the row-first vs column-first access of 32x32 matrices, but I give another example:
E.g. you have 7 buckets (0…6) of data in separate arrays and want to distribute them one after the other to threads, so thread 0 reads bucket 0 first element, thread 6 reads bucket 6 first element, thread 7 reads bucket 0 second element, and so on.

You would read within a loop over the buckets 32 elements from each bucket and store them in shared memory. Then you loop again and read the correct elements from shared memory into the threads and process them. This resorting of data can only be done efficiently with shared memory, not with the L1 caches and directly reading by the threads.

Here shared memory would be more efficient with a reuse factor of 1x or even smaller than 1x.

Another advantage of shared memory over registers, but not over global memory, is the possibility for dynamic indexing.