GK104 / GK110 shared memory bandwidth discussion

Hi,

While benchmarking different kernels on GK104 I’ve noticed an interesting pattern that my shared memory intensive applications (ex filters) don’t scale nearly as well as the kernels which don’t need to shared data on-chip compated to the older GF100.

When you look at the new SMX this seems to make sense since we have 6x more FPUs but only 2x more LD/ST units.

So I’m concluding that shared memory bandwidth is down 1/3 compared to GF100 for both GK104/GK110. Is this correct?

This brings an even stronger argument for using the new register shuffle instructions… :)

Kind regards,
JP

So I’ve continued to notice poor scaling for shared memory intensive applications.

From what I can read, one potential improvement is to work with 64-bit words, for example replacing float with float2 since Kepler can read one 32 bit word as efficient as one 64-bit word.

In other words the shared memory banks have been expanded to 8 bytes instead of previous 4.

This however is not entirely feasible for all of my applications, but it certainly makes sense for doing double precision ( 8 bytes ) computations on the GK110.

Thanks for exploring this. I had hoped we could adjust our shared mem dependencies to use shuffles, but if shuffles can only be done among 32 threads in a warp, we’re out of luck; in the most important cases, our workgroups range from 30 - 100 threads. PS: our code is almost the same for CUDA and OpenCL, and 28 nm GCN gave a nice boost over Fermi. Problem is, what if GCN v2 is better (for us) than GK110?

Well I’m secretly hoping that there is some “secret sauce” in GK110 which will give it some boost for the shared memory intensive applications.

I’m pretty excited about GK110 for the:

  • 255 register / thread
  • Increased L2 cache size ( atomics should go way up )
  • GPUDirect & dynamic parallellism

I’ve actually found some “new” strategies to overcome the smem badwidth issue.

  • Load from global to shared
  • Load from shared to registers ( register loop unrolling, template party! )
  • Compute: Maximize reusage in registers
  • Store/reduce result to shared
  • Store to global from shared

This is more applicable to more applications than one might first think. And in some cases it means increasing the instruction level parallellism of each thread.

Hi,
We’ve noticed the same thing here. Its suffice to run the convolutionSeperable from nVidia’s
SDK sample on Fermi and on K10 to see the problem (I’ve used nvprof to see it).
Changing the thread block size on the K10 in our code helped a lot.

You can also look at the great lecture from Paulius Micikevicius (google
for his name and GTC2012-GPU-Performance-Analysis.pdf). Look at pages 79-96.

Eyal

Yes, already checked through it, might do so again :)

K10 is still GK104 so we might hope for some improvement in the case of the GK110 (K20/K20X).

I’ve had to increase the thread block size going from GF100 -> GK107/GK104 which makes sense given the new larger SM. Is this similar to your experience?

Nope. I’ve changed the configuration. instead of 8x8 or 16x4, I went with 32x2.
That gave a significant boost.

Ah I see… It was recommended already going from GT200->Fermi that you make your x-block dimension at least 32 for optimal coalescing (of course that didnt apply for all applications).

Sadly the current Kepler GPUs are quite bandwidth bound. I wouldn’t mind a 250-300 GB/s card :-)