Suggestion for fermi: kill shared memory bank conflicts + coalescing

It would be fantastic if you can desgin the new hardware in a way that the shared memory bank conflicts and global memory coalescing won’t be required. I actually find that a bit complicated to manage.


Thanks for the suggestions!

Fermi does include caches for global loads, so coalescing is less of an issue. Eliminating shared memory bank conflicts completely would be very expensive in hardware, but you can still get good performance even with some conflicts.

Here’s a trick I picked up at the GTC conference to avoid bank conflicts. Always adjust your shared memory look ups by your work group id.

__shared float mem[N];

size_t localIdx = get_local_id(1)

for (int i = 0; i < N; i++)

  float val = mem[i + localIdx % N];

What you loose in modulo arithmetic you gain in no more bank conflicts. Plus, the modulo is really cheap if N is a power of 2, it can then be replaced by a bit-wise and.

assert(N is power of 2);

__shared float mem[N];

size_t localIdx = get_local_id(1)

for (int i = 0; i < N; i++)

  float val = mem[(i + localIdx) & (N - 1)];

Hope this helps.

Thanks for the trick, coleb.

Btw, I think the shared/texture cache memory is too small for things like ray tracing.
Just some maths: I use 2 bboxes per tree node ( pmin+pmax=24bytes ). For a 60-levels depth tree that would be
60*24 = 1440 bytes accessed from textures per thread ( I won’t ever count the leaves here, just the branchs ).
Using 256 threads per block will require then 360Kb of texture cache for a ray reaching the deepest nodes ( I assume fully incoherent rays ).

I would need also 60*(2 uints)*sizeof(uint)=480 bytes of shared memory per thread to keep a node stack up.
256 threads = 120Kb for the tree’s stack in shared memory ( and I need it on shared memory because global memory is very slow ).

The G80 cache is 4-6Kb and the shared memory 16Kb ( 48Kb for Fermi )… but I would need 360Kb tex cache/120Kb shared mem at a minimum… Better 384Kb + 128K to round ( because the leaves also need some caching ).
For a 30 multiprocessors card ( like the GTX 280 ) that will be 12Mb of cache ( same as an i7 ) + 4Mb of shared memory. Am I asking for an impossible then?

Pretty much yes.

When you design a chip, you have serious real-estate constraints. The size of the die is almost set in stone and the only thing you can work with is how to arrange the transistors. Either more FPUs, more control logic or more cache. Cache is huge, physically, so adding more cache will cost you compute throughput and will make the overall architecture almost equivalent to the one in i7, which will mean that it’s no longer useful as a GPU.

Global memory isn’t slow! It’s as fast as an ordinary CPU’s L1 cache in terms of bandwidth. In fact, i7’s L1 cache has a peak read/write bandwidth of 50GB/s which is about as much as you get from a 8800 GTS or about half if not a third of what you get from a GTX 285. The issue is not bandwidth, it’s latency (which may be automatically hidden very well if you have enough computation per memory access). And also the fact that uncoalesced accesses require several separate transactions which, while all executing near peak, lower the effective bandwidth.

By the way, a CPU also benefits greatly from “coalesced” memory access. Accessing memory by aligned blocks is always faster than random access, despite how big your caches may be - it’s a very fundamental thing with how memories work. Random accesses will trash the caches anyway in many cases. CPUs generally have lower peak memory bandwidth/compute throughput so the relative difference between good and bad access patterns is less pronounced. Big caches try to hide the overhead of random access but they only work within some locality and really random pointer-chasing and jumping will kill performance.

If you wrote high performance CPU apps with SSE/MMX code you noticed that the constraints imposed by those technologies are as strict as with CUDA. And if you then consider all the precautions one has to take not to trash the cache by making a random jump somewhere that overwrites a whole cache-line, which gets even worse with many cores, it occurs that writing high-throughput code for problems with complicated memory access patterns is hell in any architecture, be it i7, GPU or Cell.

Yup, the name of the game is latency now, Herb Sutter’s talk should be required viewing before playing with GPU computing:



Just found this one as well, it’s pretty good for talking about how CPU’s are butting up against about everything they can do to hide latency:…modern-hardware

And if you really have a lot of time to burn, the canonical “What every programmer should know about memory”

Another trick I just used when I ran out of shared memory was to keep my data in shared memory as halfs. This works especially well if the data is read-only since you’re not updating the values and losing precision. Note, the kernel arguments are stored in shared memory as well, a way to save on long argument lists eating your shared memory is to make a struct of the arguments and stick them into constant memory. If latency is killing you need to increase the number of work items being used. If you’re hitting the limit of the number of registers try using the -cl-nv-maxrregcount=n build option when building the kernel (note, this is a non-standard, NVidia only OpenCL build flag).

There’s enough memory (and bandwidth) there, just may have to jump through some high hoops to use it.


Thanks for the resources there, very interesting.

I’ve noticed that Cliff Click, while talking about pipeline stalls, said that a lot of GPUs stall when waiting for a memory fetch to complete. That’s not true, CUDA cards don’t stall and I’m pretty sure neither do AMD’s. I thought it was worth to point it out so that people don’t get the wrong idea.

I find Herb Sutter to be an outstanding speaker!

Thanks for the post.