What's new in Maxwell 'sm_52' (GTX 9xx) ?

allanmac · September 19, 2014, 3:30am

What’s new in Maxwell ‘sm_52’ (GTX 9xx)?

The first CUDA difference noted on the NVIDIA blog is that shared memory has been bumped up to 96 KB. That's 2x Kepler and 50% more than Maxwell v1.
That’s a welcome change since some people had kernels tuned for a shared-to-register ratio of 1.5 – i.e. the Fermi ratio which allowed about 96 bytes per thread in a full-sized 63 register x 512 thread block.

With Kepler/Maxwell-v1/Maxwell-v2 having 64K 32-bit registers, Maxwell-v2 returns to that ratio and there are once again 24 32-bit words of shared mem per 64 register x 1024 thread block.

The Maxwell Tuning Guide and the CUDA C Programming Guide note that similar to GK110B, GM204 can "opt-in to caching of global loads in its unified L1/Texture cache."
There appears to be support for FP16 vector atomics operating on global memory. Expose this in CUDA, please!
The GTX 980 is reported as having two asynchronous copy engines.
There is also a new CUDA Toolkit with sm_52 support.
New drivers: 343/344.xx. FYI, these drivers no longer support sm_1x devices. I had to remove a GT 240 (x1) this morning in order to boot Win7/x64.
Boost clocks on the 980 look to be as high as we've seen on the 750 Ti. Some of the "golden" GTX 750 Ti's boosted to 1320 MHz out of the box. Amazingly there is an EVGA 980 listed with a guaranteed boost of 1342 MHz (!). And @cbuchner1's crypto link shows overclocks reaching 1520 Mhz (!).

Anything else?

scottgray · September 19, 2014, 4:17am

Cool, I’ll be picking one up as soon I can find one… Don’t see one listed on NewEgg yet. I have a ton of in depth detail on Maxwell that I still need to write up. Been too busy tweaking further performance and features out of my assembler. My sgemm implementation now runs at over 98% efficiency, 3.7% over cublas. This is pretty much right at the synthetic level minus the small overhead of things like bar.syncs you need for real code. Anyway, with this new hardware that should translate into close to 200Gflops over cublas or about 5.3 Tflops total.

allanmac · September 19, 2014, 4:26am

Wow! You should take some power measurements too, if possible. It would be cool to see how hard your SGEMM is pushing the PCB. Either a Kill-a-Watt or the TDP sensor output in something like GPU-Z (it shows a percent of TDP on the 750 Ti).

NVD · September 19, 2014, 8:28am

Can someone please post the entire deviceQuery output for GM204?

Also amazing architecture, Maxwell second generation practically demolishes Big Kepler GK110 while using a lot less power in compute and it’s not even Big Maxwell GM200 yet.

Ailleur · September 19, 2014, 12:20pm

It is still using a 256bits bus, which I am not loving. It has less bandwidth than a 780Ti by a fair margin (224 vs 336). Even the 780 (not Ti), was running at 384 bits for 288GB/s of bandwidth.

Hopefully, since this is GM204, we will see a GM200 in the first Tesla product with a wider 384 bits memory bus.

cbuchner1 · September 19, 2014, 12:26pm

I am still sitting on 3 GTX 780Ti from the crypto mining craze of last year. IMHO this upgraded Maxwell architecture does not really have any killer features that want me to switch hardware right now. They also dropped some instructions from the CUDA cores to save die space and power - the video instructions in particular. So, I guess I’ll pass. These 780Ti’s are going to serve me well enough.

For those into crypto mining (still), this might be of interest:

GTX 980 crypto mining performance:
[url]http://cryptomining-blog.com/3503-crypto-mining-performance-of-the-new-nvidia-geforce-gtx-980/[/url]

the table with the raw performance figures:

Tiomat · September 19, 2014, 12:59pm

I am slightly curious as to why I cannot find any review on my usual suspects that talk about compute performance. This seems odd as usually they have at least one mention of GPGPU. Has anyone else managed to find a compute based review (aside from the crypto stuff above). The cynic in me wonders whether the reviewers have been ‘advised’ not to include compute figures and focus on the (admittedly very impressive) gaming performance.

cbuchner1 · September 19, 2014, 1:22pm

I’ve seen Luxmark 2.0 figures in many benchmarks. I believe this measures OpenCL performance.

Ailleur · September 19, 2014, 1:26pm

Anandtech has some compute numbers.
[url]http://anandtech.com/show/8526/nvidia-geforce-gtx-980-review/20[/url]

Tiomat · September 19, 2014, 2:08pm

Thanks, that definitely is a bit surprising in a good way. It actually looks like there is an improvement in compute. I will still be wary of the memory bandwidth until I get my grubby hands on one for a good thrashing.

CudaaduC · September 19, 2014, 7:01pm

NewEgg does list the GTX 980 and GTX 970 now, but ‘sold out’ with an ETA of 9/23.

[url]EVGA GeForce GTX 980 04G-P4-1980-KR 4GB GAMING, Silent Cooling Graphics Card - Newegg.com

Hopefully there will be a GTX 980ti released.

Regardless I am going to get one and run my atypical compute tests against the GTX 780ti.
Nvidia says that they got 2.7 Tflops for the nbody, while I was only able to get 2.1 Tflops with the GTX 780ti.

scottgray · September 20, 2014, 10:40pm

allanmac: thanks for the power measurement tip. I tried GPU-z a while back and it was broken with Maxwell and I forgot all about it. Got the new version and it works fine. Looking at the clocks and TDP values during computation cleared up a few things for me. My fastest implementation runs at 1658 Gflops sustained and it’s able to do that at a 1320 clock. TDP hovers between 98 and 99%.

However, using different instruction ordering patterns and different register reuse and bank access patterns was giving me mysterious results. But looking at the clock and TDP I can now be more sure of why. Less register reuse increases register bank bandwidth and drops the clock down to 1306, and the one with the different ordering but same amount of reuse kept the clock at 1320 but TDP dropped down to 96%. This means it’s stalling somewhere. I’m now pretty certain it’s the register bank conflicts between ffmas and ongoing memory ops. Memory ops hold on to their register values for at least 20 clocks (which is why you need write-after-read barriers for memory operands). So during that time it makes sense you could get additional bank conflicts. I’m not sure which gets prioritized in the event of a conflict but either way could slow things down.

Also, I looked at my op code flags for the ATOM op. It’s clear there are holes for future expansion so I gave them a try with the new cuobjdump and found the F32x2 flag at least:

ATOM: type
0x0002000000000000 .S32
0x0004000000000000 .U64
0x0006000000000000 .F32.FTZ.RN
0x0008000000000000 .F16x2.FTZ.RN
0x000a000000000000 .S64
0x0002000000000000 .64

You would think the F16x4 value would be using the “c” or “e” flag (“1” is used for the 64 bit addresssing ‘E’ flag). I also tried the “a” flag since S64 is supposed to be considered illegal with ATOM.ADD (you can see the “2” flag is overloaded depending on the mode: CAS uses .64). But cuobjdump had no issue with it… maybe that support has been added as well. So, maxas supports F16x2 at least now (it’s checked in if you want to play with it).

New 980 arrives Monday. Eager to put it through it’s paces.

allanmac · September 21, 2014, 12:24am

I wonder why the clock is dropping? Is the 750 Ti overheating?

Oh, the clock is probably dropping because you’re at 99% TDP. None of my benchmarks have managed to get beyond 70% TDP yet report 99% GPU and MEM so that’s quite an accomplishment to max out the TDP with a CUDA app. :)

If it’s actually heat and not wattage, then you could try installing something like EVGA Precision X and max out your fan rpm’s.

If you haven’t already, dumping all your metrics with “nvprof.exe -m all <sgemm.exe>” might reveal some more interesting stuff. It might take a while to capture all the metrics.

That’s cool that FP16x2 atomics are visible. Now I just wish that FP16 vector FMAs existed in the SMM (fma.sat.v2.f16).

I feel sorry for your GTX 980. It probably thinks it’s going to a quiet PC and will only play a few hours of video games each week.

scottgray · September 21, 2014, 3:10am

I had Precision X installed but didn’t notice you could control the fan speed. I really hate “enthusiast” UI’s. But upping the fan got me to 1660 Gflops sustained. The slower configs ran a touch faster but don’t think they’re temperature bound. I’ve included comments in the code on this issue here:

https://code.google.com/p/maxas/source/browse/sgemm/sgemm128.sass
and this might help too, though it’s a work in progress (texture load mapping is already outdated)…
https://code.google.com/p/maxas/wiki/sgemm

Forgot about nvprof but it turns out not to give you much data beyond what you get from Nsight, which I’ve been leveraging heavily. My IPC issued/executed per SM is at 4.26 (out of a theoretical 4.29 with the level of dual issues in my code). Warp issue efficiency is at 99% and 15 out of 16 warps per SM are eligible on average.

F16x2 FMA’s would be cool to have. In fact, one of the driving factors behind wanting to implement my own sgemm was wanting to leverage the normalized float functionality of the texture loads. That way I can store my weight matrices with 16 or even 8 bit precision if I want. The other being needing to implement custom convolution kernels. The CudNN lib that Nvidia just released is cool and all, but it’s still using MAGMA style sgemm :/

Ok, next up, I think it’s a couple final features for the assembler, then I’ll document it all (I promise). I want fully automatic bank conflict avoiding register allocation for the non-fixed registers (I’m most of the way there for this). And I want a simple built in compiler for C like expressions. With C syntax you can write all your tedious memory offset code normally and have all the assembly handled for you. Then you can focus your assembly purely on the performance sections of the kernel. So with those two features combined writing a kernel should be as painless as working in cuda c for the mundane stuff, but you get complete register control and the full power of sass for your performance code.

I used to be one of those gamers and I doubt I’ve clocked enough gpu compute hours to remotely approach the number of graphics ops executed over past years… but I’m working on it :)

Jimmy_Pettersson · September 22, 2014, 1:57pm

Very nice! I hade missed that the V2 had 96 KB of shared memory.

I would be really interested to hear about peak CUFFT performance for FP32, anyone performed any such benchies? Last time I had a look it appeared to be a bandiwdth bound problem, perhaps the new larger L2 will help out massively for smaller FFT sizes?

@scottgray: 98% utilization on SGEMM is very impressive! What utilization did you get on Kepler?

scottgray · September 22, 2014, 3:55pm

I never owned a Kepler… I was about to pull the trigger on getting a 780Ti early this year, but then Nvidia surprised us by releasing a Maxwell card early. I mainly needed a card for development so having the raw number of cuda cores wasn’t that important…but getting to work on the newest architecture was. Happy I made that decision as now I’m well positioned to really leverage the performance of the beefier version.

Anyway, the cublas Kepler implementation is pretty solid, so if you’re curious about sgemm performance just try the cublas lib. I think I saw a bench claiming 3800 Gflops on a K40?

For Maxwell my code is only 3% faster than the cublas implenation… at least for large matrices. For smaller ones I can double the performance since I’m halving the block sizes. I’ll post some charts with benches once my new Maxwell arrives today. Maybe I’ll give the FFT lib a spin if it’s not too much trouble (haven’t used that lib before).

scottgray · September 23, 2014, 1:35am

Ok, so the preliminary results are in. I’m not quite where I should be because my code is currently 32 bit and even a 8192 sized square matrix doesn’t quite fill 16 SMs (at 16 warps per SM which is close but not quite enough to hit the 98% level I was getting on the 750Ti).

But here are the results (Max128 is my custom sgemm implementation):

Gigabyte GeForce GTX 980

Max128 GFLOPS: 6218 (size: 8192, iterations: 100)
Cublas GFLOPS: 5987 (size: 8192, iterations: 100)

GPU Clock: 1640 (+400 over default!)
TDP: 112%
Temp: 72C
Volts: 1.225 (default)

I also tested this config at 5000 iterations and it seems to be stable. So all in all I’m pretty pleased. I had a good feeling this card would overclock well given my experience with the 750Ti.

Next up is to add some streams to my benchmark and see if I cant hit the 6.5 or 6.6 Tflops I should be getting.

CudaaduC · September 23, 2014, 1:57am

scottgray:

Ok, so the preliminary results are in. I’m not quite where I should be because my code is currently 32 bit and even a 8192 sized square matrix doesn’t quite fill 16 SMs (at 16 warps per SM which is close but not quite enough to hit the 98% level I was getting on the 750Ti).

But here are the results (Max128 is my custom sgemm implementation):

Gigabyte GeForce GTX 980

Max128 GFLOPS: 6218 (size: 8192, iterations: 100)
Cublas GFLOPS: 5987 (size: 8192, iterations: 100)

GPU Clock: 1640 (+400 over default!)
TDP: 112%
Temp: 72C
Volts: 1.225 (default)

I also tested this config at 5000 iterations and it seems to be stable. So all in all I’m pretty pleased. I had a good feeling this card would overclock well given my experience with the 750Ti.

Next up is to add some streams to my benchmark and see if I cant hit the 6.5 or 6.6 Tflops I should be getting.

Thanks for the report! Fantastic results!

Would love to get an idea of how the GTX 980 handles my permutation code, which is compute bound:

https://github.com/OlegKonings/CUDA_permutations_large/blob/master/EXP3/EXP3/EXP3.cu

One version just permutes an array, and the other permutes,evaluates,scans, reduces(which is the default implementation).

I cannot even image what the ti version will be like.

allanmac · September 23, 2014, 2:09am

1640 MHz and 6.2 TFLOPS!!!

scottgray · September 23, 2014, 2:37am

CudaaduC: here are your results:

Capable!
Starting GPU testing:

Testing full version.
GPU timing: 7.48 seconds.
GPU answer is: 1610

The evaluation had ( n!(4+2n+n^2)) steps, which is apx. 1239177139200 iterations.

[3]= 277,[1]= 438,[7]= 5,[9]= 127,[5]= 129,[2]= 19,[0]= 33,[12]= 3,[4]= 449,[6]= 40,[8]= 22,[10]= 61,[11]= 7,
GPU total=1610

Just compiled it and ran it as is… let me know if there are some params I need to pass, or keyboard input it’s expecting…

Topic		Replies	Views
So what's new about Maxwell? CUDA Programming and Performance	166	55875	March 10, 2015
my speedy SGEMM CUDA Programming and Performance	91	275877	May 29, 2013
Cuda program results are always zero in HW, correct in EMU? CUDA Programming and Performance	35	11095	May 23, 2010
Any advice on adjusting code for Maxwell when coming from Kepler CUDA Programming and Performance	20	2784	November 6, 2014
Technical questions on GTX1080ti multiplication CUDA Programming and Performance	14	1873	November 11, 2017
Unofficial Kepler Slides from Random Gamer Site Yeah, yeah, but we only have another week to rumor-m CUDA Programming and Performance	63	10323	April 5, 2012
Fermi? Sounds interesting... CUDA Programming and Performance	58	15504	October 18, 2009
GPU Perfomance How much GFlops??? CUDA Programming and Performance	27	37048	August 30, 2009
One weird trick to get a Maxwell v2 GPU to reach its max memory clock ! CUDA Programming and Performance	59	17710	April 22, 2016
my speedy FFT 3x faster than CUFFT CUDA Programming and Performance	139	241011	November 16, 2011

What's new in Maxwell 'sm_52' (GTX 9xx) ?

Related topics