How to calculate shared memory bandwidth?

This article suggests GTX 690 is only half the shared memory bankwidth of GTX 460. How does it arrive at this number? Thanks

This is the ratio of shared memory bandwidth / flops, for 32bit access.

This is mainly due by the incredible increase of SP flops on Kepler (compared to Fermi), that is targeted for 3D gaming. Notice that GK110 have the same ratio, but nVidia choose to increase L2 cache size to try to mitigate this problem for GPGPU computing.

Thanks for your reply. That means GK110’s suitability for GPGPU is still questionable??

Every design choice is a tradeoff and there is no reason the future has to be a superset of the present, so “suitability” depends on your problem. Kepler has less shared memory bandwidth per FLOP, so if your code is limited by shared memory bandwidth, then yes, Kepler will be worse. I don’t know how many programs actually have that restriction however. I’ve seen various programs that I’ve written run anywhere from 10% slower to 100% faster on the GTX 680 compared to the 580. The 10% slower case was a program that has quite an erratic memory access pattern, so I suspect the loss of L2 cache and global memory bandwidth hurt the most there. The 100% faster case was a program that was compute bound and most of the computation was special functions. Here I think the very large number of CUDA cores and extra special function units helped.

(One reminder: It is really, really important that you benchmark different block sizes on Kepler. I made this mistake initially, and my compute-bound code ran slower than Fermi. Reoptimizing things led to a massive increase in that case. It didn’t help the memory bound program, but that is more to be expected.)

The relative reduction in shared memory bandwidth in Kepler might be the reason for the addition of the shuffle instructions, which allow threads in a warp to exchange data without shared memory at all. (In effect, shuffle adds an optional extra layer in the memory hierarchy between registers and shared memory.) The instruction seems to be tailor-made for FFTs, which probably put a lot of pressure on shared memory.

On page 13, it says shared memory bandwidth has been doubled to 256B per core clock. What does that mean? Does that mean since the GLOPS is tripled, so the overall effect is slower and prudent use of shuffle command is recommended?

If I understand the details in that document, the shared memory bandwidth is only double if each thread doing a 64-bit load from shared memory. Basically, a warp of threads on GK110 doing a 32-bit or 64-bit load takes the same amount of time, so in the 64-bit case the bandwidth is twice as high as the 32-bit case.