What's new in Maxwell 'sm_52' (GTX 9xx) ?

What’s new in Maxwell ‘sm_52’ (GTX 9xx)?

  • The first CUDA difference noted on the NVIDIA blog is that shared memory has been bumped up to 96 KB. That's 2x Kepler and 50% more than Maxwell v1.

    That’s a welcome change since some people had kernels tuned for a shared-to-register ratio of 1.5 – i.e. the Fermi ratio which allowed about 96 bytes per thread in a full-sized 63 register x 512 thread block.

    With Kepler/Maxwell-v1/Maxwell-v2 having 64K 32-bit registers, Maxwell-v2 returns to that ratio and there are once again 24 32-bit words of shared mem per 64 register x 1024 thread block.

  • The Maxwell Tuning Guide and the CUDA C Programming Guide note that similar to GK110B, GM204 can "opt-in to caching of global loads in its unified L1/Texture cache."
  • There appears to be support for FP16 vector atomics operating on global memory. Expose this in CUDA, please!
  • The GTX 980 is reported as having two asynchronous copy engines.
  • There is also a new CUDA Toolkit with sm_52 support.
  • New drivers: 343/344.xx. FYI, these drivers no longer support sm_1x devices. I had to remove a GT 240 (x1) this morning in order to boot Win7/x64.
  • Boost clocks on the 980 look to be as high as we've seen on the 750 Ti. Some of the "golden" GTX 750 Ti's boosted to 1320 MHz out of the box. Amazingly there is an EVGA 980 listed with a guaranteed boost of 1342 MHz (!). And @cbuchner1's crypto link shows overclocks reaching 1520 Mhz (!).

Anything else?

Cool, I’ll be picking one up as soon I can find one… Don’t see one listed on NewEgg yet. I have a ton of in depth detail on Maxwell that I still need to write up. Been too busy tweaking further performance and features out of my assembler. My sgemm implementation now runs at over 98% efficiency, 3.7% over cublas. This is pretty much right at the synthetic level minus the small overhead of things like bar.syncs you need for real code. Anyway, with this new hardware that should translate into close to 200Gflops over cublas or about 5.3 Tflops total.

Wow! You should take some power measurements too, if possible. It would be cool to see how hard your SGEMM is pushing the PCB. Either a Kill-a-Watt or the TDP sensor output in something like GPU-Z (it shows a percent of TDP on the 750 Ti).

Can someone please post the entire deviceQuery output for GM204?

Also amazing architecture, Maxwell second generation practically demolishes Big Kepler GK110 while using a lot less power in compute and it’s not even Big Maxwell GM200 yet.

It is still using a 256bits bus, which I am not loving. It has less bandwidth than a 780Ti by a fair margin (224 vs 336). Even the 780 (not Ti), was running at 384 bits for 288GB/s of bandwidth.

Hopefully, since this is GM204, we will see a GM200 in the first Tesla product with a wider 384 bits memory bus.

I am still sitting on 3 GTX 780Ti from the crypto mining craze of last year. IMHO this upgraded Maxwell architecture does not really have any killer features that want me to switch hardware right now. They also dropped some instructions from the CUDA cores to save die space and power - the video instructions in particular. So, I guess I’ll pass. These 780Ti’s are going to serve me well enough.

For those into crypto mining (still), this might be of interest:

GTX 980 crypto mining performance:

the table with the raw performance figures:

I am slightly curious as to why I cannot find any review on my usual suspects that talk about compute performance. This seems odd as usually they have at least one mention of GPGPU. Has anyone else managed to find a compute based review (aside from the crypto stuff above). The cynic in me wonders whether the reviewers have been ‘advised’ not to include compute figures and focus on the (admittedly very impressive) gaming performance.

I’ve seen Luxmark 2.0 figures in many benchmarks. I believe this measures OpenCL performance.

Anandtech has some compute numbers.

Thanks, that definitely is a bit surprising in a good way. It actually looks like there is an improvement in compute. I will still be wary of the memory bandwidth until I get my grubby hands on one for a good thrashing.

NewEgg does list the GTX 980 and GTX 970 now, but ‘sold out’ with an ETA of 9/23.

[url]EVGA GeForce GTX 980 04G-P4-1980-KR 4GB GAMING, Silent Cooling Graphics Card - Newegg.com

Hopefully there will be a GTX 980ti released.

Regardless I am going to get one and run my atypical compute tests against the GTX 780ti.
Nvidia says that they got 2.7 Tflops for the nbody, while I was only able to get 2.1 Tflops with the GTX 780ti.

allanmac: thanks for the power measurement tip. I tried GPU-z a while back and it was broken with Maxwell and I forgot all about it. Got the new version and it works fine. Looking at the clocks and TDP values during computation cleared up a few things for me. My fastest implementation runs at 1658 Gflops sustained and it’s able to do that at a 1320 clock. TDP hovers between 98 and 99%.

However, using different instruction ordering patterns and different register reuse and bank access patterns was giving me mysterious results. But looking at the clock and TDP I can now be more sure of why. Less register reuse increases register bank bandwidth and drops the clock down to 1306, and the one with the different ordering but same amount of reuse kept the clock at 1320 but TDP dropped down to 96%. This means it’s stalling somewhere. I’m now pretty certain it’s the register bank conflicts between ffmas and ongoing memory ops. Memory ops hold on to their register values for at least 20 clocks (which is why you need write-after-read barriers for memory operands). So during that time it makes sense you could get additional bank conflicts. I’m not sure which gets prioritized in the event of a conflict but either way could slow things down.

Also, I looked at my op code flags for the ATOM op. It’s clear there are holes for future expansion so I gave them a try with the new cuobjdump and found the F32x2 flag at least:

ATOM: type
0x0002000000000000 .S32
0x0004000000000000 .U64
0x0006000000000000 .F32.FTZ.RN
0x0008000000000000 .F16x2.FTZ.RN
0x000a000000000000 .S64
0x0002000000000000 .64

You would think the F16x4 value would be using the “c” or “e” flag (“1” is used for the 64 bit addresssing ‘E’ flag). I also tried the “a” flag since S64 is supposed to be considered illegal with ATOM.ADD (you can see the “2” flag is overloaded depending on the mode: CAS uses .64). But cuobjdump had no issue with it… maybe that support has been added as well. So, maxas supports F16x2 at least now (it’s checked in if you want to play with it).

New 980 arrives Monday. Eager to put it through it’s paces.

I wonder why the clock is dropping? Is the 750 Ti overheating?

Oh, the clock is probably dropping because you’re at 99% TDP. None of my benchmarks have managed to get beyond 70% TDP yet report 99% GPU and MEM so that’s quite an accomplishment to max out the TDP with a CUDA app. :)

If it’s actually heat and not wattage, then you could try installing something like EVGA Precision X and max out your fan rpm’s.

If you haven’t already, dumping all your metrics with “nvprof.exe -m all <sgemm.exe>” might reveal some more interesting stuff. It might take a while to capture all the metrics.

That’s cool that FP16x2 atomics are visible. Now I just wish that FP16 vector FMAs existed in the SMM (fma.sat.v2.f16).

I feel sorry for your GTX 980. It probably thinks it’s going to a quiet PC and will only play a few hours of video games each week.

I had Precision X installed but didn’t notice you could control the fan speed. I really hate “enthusiast” UI’s. But upping the fan got me to 1660 Gflops sustained. The slower configs ran a touch faster but don’t think they’re temperature bound. I’ve included comments in the code on this issue here:

and this might help too, though it’s a work in progress (texture load mapping is already outdated)…

Forgot about nvprof but it turns out not to give you much data beyond what you get from Nsight, which I’ve been leveraging heavily. My IPC issued/executed per SM is at 4.26 (out of a theoretical 4.29 with the level of dual issues in my code). Warp issue efficiency is at 99% and 15 out of 16 warps per SM are eligible on average.

F16x2 FMA’s would be cool to have. In fact, one of the driving factors behind wanting to implement my own sgemm was wanting to leverage the normalized float functionality of the texture loads. That way I can store my weight matrices with 16 or even 8 bit precision if I want. The other being needing to implement custom convolution kernels. The CudNN lib that Nvidia just released is cool and all, but it’s still using MAGMA style sgemm :/

Ok, next up, I think it’s a couple final features for the assembler, then I’ll document it all (I promise). I want fully automatic bank conflict avoiding register allocation for the non-fixed registers (I’m most of the way there for this). And I want a simple built in compiler for C like expressions. With C syntax you can write all your tedious memory offset code normally and have all the assembly handled for you. Then you can focus your assembly purely on the performance sections of the kernel. So with those two features combined writing a kernel should be as painless as working in cuda c for the mundane stuff, but you get complete register control and the full power of sass for your performance code.

I used to be one of those gamers and I doubt I’ve clocked enough gpu compute hours to remotely approach the number of graphics ops executed over past years… but I’m working on it :)

Very nice! I hade missed that the V2 had 96 KB of shared memory.

I would be really interested to hear about peak CUFFT performance for FP32, anyone performed any such benchies? Last time I had a look it appeared to be a bandiwdth bound problem, perhaps the new larger L2 will help out massively for smaller FFT sizes?

@scottgray: 98% utilization on SGEMM is very impressive! What utilization did you get on Kepler?

I never owned a Kepler… I was about to pull the trigger on getting a 780Ti early this year, but then Nvidia surprised us by releasing a Maxwell card early. I mainly needed a card for development so having the raw number of cuda cores wasn’t that important…but getting to work on the newest architecture was. Happy I made that decision as now I’m well positioned to really leverage the performance of the beefier version.

Anyway, the cublas Kepler implementation is pretty solid, so if you’re curious about sgemm performance just try the cublas lib. I think I saw a bench claiming 3800 Gflops on a K40?

For Maxwell my code is only 3% faster than the cublas implenation… at least for large matrices. For smaller ones I can double the performance since I’m halving the block sizes. I’ll post some charts with benches once my new Maxwell arrives today. Maybe I’ll give the FFT lib a spin if it’s not too much trouble (haven’t used that lib before).

Ok, so the preliminary results are in. I’m not quite where I should be because my code is currently 32 bit and even a 8192 sized square matrix doesn’t quite fill 16 SMs (at 16 warps per SM which is close but not quite enough to hit the 98% level I was getting on the 750Ti).

But here are the results (Max128 is my custom sgemm implementation):

Gigabyte GeForce GTX 980

Max128 GFLOPS: 6218 (size: 8192, iterations: 100)
Cublas GFLOPS: 5987 (size: 8192, iterations: 100)

GPU Clock: 1640 (+400 over default!)
TDP: 112%
Temp: 72C
Volts: 1.225 (default)

I also tested this config at 5000 iterations and it seems to be stable. So all in all I’m pretty pleased. I had a good feeling this card would overclock well given my experience with the 750Ti.

Next up is to add some streams to my benchmark and see if I cant hit the 6.5 or 6.6 Tflops I should be getting.

Thanks for the report! Fantastic results!

Would love to get an idea of how the GTX 980 handles my permutation code, which is compute bound:


One version just permutes an array, and the other permutes,evaluates,scans, reduces(which is the default implementation).

I cannot even image what the ti version will be like.

1640 MHz and 6.2 TFLOPS!!!

CudaaduC: here are your results:

Starting GPU testing:

Testing full version.
GPU timing: 7.48 seconds.
GPU answer is: 1610

The evaluation had ( n!(4+2n+n^2)) steps, which is apx. 1239177139200 iterations.

[3]= 277,[1]= 438,[7]= 5,[9]= 127,[5]= 129,[2]= 19,[0]= 33,[12]= 3,[4]= 449,[6]= 40,[8]= 22,[10]= 61,[11]= 7,
GPU total=1610

Just compiled it and ran it as is… let me know if there are some params I need to pass, or keyboard input it’s expecting…