What’s new in Maxwell ‘sm_52’ (GTX 9xx)?
- The first CUDA difference noted on the NVIDIA blog is that shared memory has been bumped up to 96 KB. That's 2x Kepler and 50% more than Maxwell v1.
That’s a welcome change since some people had kernels tuned for a shared-to-register ratio of 1.5 – i.e. the Fermi ratio which allowed about 96 bytes per thread in a full-sized 63 register x 512 thread block.
With Kepler/Maxwell-v1/Maxwell-v2 having 64K 32-bit registers, Maxwell-v2 returns to that ratio and there are once again 24 32-bit words of shared mem per 64 register x 1024 thread block.
- The Maxwell Tuning Guide and the CUDA C Programming Guide note that similar to GK110B, GM204 can "opt-in to caching of global loads in its unified L1/Texture cache."
- There appears to be support for FP16 vector atomics operating on global memory. Expose this in CUDA, please!
- The GTX 980 is reported as having two asynchronous copy engines.
- There is also a new CUDA Toolkit with sm_52 support.
- New drivers: 343/344.xx. FYI, these drivers no longer support sm_1x devices. I had to remove a GT 240 (x1) this morning in order to boot Win7/x64.
- Boost clocks on the 980 look to be as high as we've seen on the 750 Ti. Some of the "golden" GTX 750 Ti's boosted to 1320 MHz out of the box. Amazingly there is an EVGA 980 listed with a guaranteed boost of 1342 MHz (!). And @cbuchner1's crypto link shows overclocks reaching 1520 Mhz (!).