What's new in Maxwell 'sm_52' (GTX 9xx) ?

That is really fast. With the GTX 780ti that full version takes 9.3 seconds.

For that arch there is probably some tweaking that can be done to ‘blockSize’ which dictates the amount of work for each thread in a block, but already with those default parameters you have verified that I must move to Maxwell.


Would be interesting to see small sized 1D FFT:s with many batches. EX:

N = 256-2048
NbBatches = 2048-16384

I can give it a spin on my Titan & K40 for reference as well. Which OS do you run on?

Probably a world record for a single chip configuration?

So TDP was 112% => 6218 GFLOP/ (165 watt * 1.12) ~ 33.6 GFLOPS/watt. I’m guessing an FPGA or custom ASIC could beat that efficiency for really small matrices but would get into more trouble for larger matrices with more memory bandwidth requirements…?

As I said, I should be closer to 6.6 Tflops but something is holding me back (my earlier comment of not having enough blocks at N=8192 was a goof, wasn’t thinking in n^2 land). Even at the default clock I’m still only at 94.5% efficiency (cublas even less). On the 750Ti, I was consistently over 98%. I need to run some synthetic benches to see what’s up. Might just be an artifact of having more SMs? Maybe throwing more than one stream at it will expose the missing performance?

Jimmy, I’m on Win7 but if you send me something I can compile (command line is fine) I’d be happy to run it for you.

Oh and one other interesting feature that I don’t think has been mentioned: ASYNC_ENGINE_COUNT=2 on this hardware. Not sure how that compares to say a 780Ti, but I think I remember it used to being only Tesla cards having two.

344.16 WHQL driver for GTX 980 & GTX 970, mostly the same as 344.11 WHQL driver except with 2 compatibility fixes.

Well I give up for now. I can not for the life of me write a synthetic benchmark that performs at more than 98% of theoretical. The clocks are being lost somewhere and I’m out of ideas to account for them. Maybe it’s just some overhead required by the hardware. I guess I’m not too concerned about it since the overclock-ability can more than make up for that slight performance hit.

Also, at the higher core clock the memory speed starts to become a bottleneck. Feeding 16 SM’s over a 256bit bus isn’t as easy as feeding 5 with 128bits. Which is why my sgemm cant hit the 96% level (2% from mysterious reasons and 2% difference from synthetic). I need to experiment with overclocking the memory more but was having trouble with that for some reason. At default clocks I am hitting the 96% level just fine. So it seems with sgemm at 1640MHz, 6.2 Tflops is the limit unless I can remove that memory bottleneck.

Also telling is my 64 threads/block implementation. That one needs double the bandwidth compared to the 256 thread version. While it still does well at the small matrix sizes (because it can split them up into 4x as many blocks), at the larger sizes it is severely limited by bandwidth. Whereas on the 750Ti it was performing very nearly at speed of the 256 thread version on large matrices.

So if you’re on the fence about this hardware… you might want to hold out for the larger memory bus sized “Ti” version. Who knows when that’s coming out though.

I’d say the percentages you are seeing are par for the course. My last experience with writing a synthetic test (to establish the maximum floating-point throughput of an AMD K7) is admittedly 12+ years back. But I recall how I spent copious time trying to adapt the code to details of the instruction decode process (instruction alignment etc), the register file usage, pipeline depth etc. As I recall I was able to achieve about 96% or 97% of the theoretical throughput after several days of work.

In a well-balanced processor, once code is pushed close to the theoretical throughput limits it comes up against multiple hardware throughput limits simultaneously so that the tiniest noise caused by any number of hardware mechanisms (e.g. a loop closing branch) causes pipeline bubbles. I don’t know what your code looks like, but I could imagine that minor issues like branching, or the occasional I-cache miss could be a reason 100% cannot be reached. Have you looked at the profiler output to see what it claims about the remaining few % of time?

Regarding memory:

For NxX matrix with N=8192 you would be required to input & output:

2*(NN) * 4 Byte + NN*4 byte ~ 0.8053 GB ( 2 matrices in 1 out)

And you would be processing that in:

2*N^3 * 10^-9 GFLOP / (6218 GFLOP/s) ~ 1099.5 GFLOP / 6218 GFLOP/s ~ 0.1768 seconds.


Hence you would need to throughput 0.8053 GB in 0.1768 s ~ 4.55 GB/s

Which would lead me to think that you’re still far from memory bound?

  • disclaimer* It’s really late so maybe my math could well be off here :-D disclaimer

njuffa, I’m using my assembler to construct the synthetic benchmark so I dont need to worry about dealing with ptxas scrambling my code. When I run it in Nsight or nvprof, the delays seem to go away. The unrolled loop size is well within the 8kb instruction cache so no fetch latencies. I’m at exactly 4.0 IPC issued/executed. My warp efficiency is 99.93%. So maybe the delay is actually outside of the kernel? This same bench runs right at the theoretical on my 750Ti (or rather 6 Gflops shy). Maybe I’ll submit a bug to Nvidia…

Jimmy, I absolutely agree with you that sgemm is not memory bound. I didn’t even consider investigating that aspect of it till I ran out of ideas. However, there is no denying the analysis results I’m seeing. Memory dependencies are the only metric that increases at the higher core clock (where I start falling off my 96% efficiency levels). Also the 64 thread version is very telling as well. The only difference that matters from a compute standpoint is the fact that it has four tex.1d.v4 loads instead of the 256 thread’s two loads (it has a quarter the threads to load half the shared memory size).

The overall averaged numbers from these implementations are well within the performance limits of the device. That’s not to say however that there may be bursts of activity well beyond those limits. And the more I think about it, the more I’m inclined to want to investigate the order in which my blocks are being executed. I think just as I gain a lot of performance from ordering my ffma’s to maximize register bank caching, I can probably achieve the same effect with the L2 cache.

Has anyone here tried remapping their block indexes to achieve higher levels of L2 caching? I’m not as concerned about the texture/readonly cache since my vec4 loads are pulling in and consuming whole cache lines at a time, but I suppose that cache might be better leveraged as well.

I think the first thing I’ll try is switch over to 8 bit normalized floats. That should cut the bandwidth down by 4 and validate my analysis. Should be a rather simple change to the code.

Ok, 8 Bit normalized floats did the trick, mostly. I’m still about 65 Gflops short of where I should be (6415 Gflops or so).

So lets call it 6.35 Tflops.

Also my 64 thread version is humming along again at just shy of the 256 thread version: 6.29 Tflops (although at a significantly higher power draw because of the double memory bandwidth needs).

Next up is to analyze the block ordering and see if it’s consistent, and if so, if there’s a way to improve cache usage with a remapping. Even if this doesn’t give me a lot of extra flops, it should significantly lower TDP levels.

Oh, and I’m unable to effect memory clocks with either precision x or with afterburner. Maybe it doesn’t take for cuda apps? Maybe my 750Ti is fouling things up? Any advice from the experts here on that?

My only idea is to try NVIDIA Inspector and see if it works with your card. I was able to set a GTX Titan to P2 state instead of P0 and it gave me some overclocking improvements. See if this works for you:

I’m impressed at the overclock you were able to do on that card AND be stable, that’s pretty amazing. I couldn’t get GK110 past ~1250MHz without instabilities, although mine was with a DP code.

I wonder if at that amount of SP FLOPS it makes sense to code DP arithmetic with floats. Something like this, for example: http://andrewthall.org/papers/df64_qf128.pdf.

Thanks, that did the trick. It seems cuda apps use the P2 state memory clock. Which I guess makes some sense if cuda apps are anticipated to have more random access than your typical graphics streaming type loads. Though sgemm works just fine at higher clock speeds. I got the memory to a max of 8000MHz (+500). That tool doesn’t allow setting the P2 state beyond that for some reason. So here are the new numbers:

With 8 and 16 bit normalized floats I can hit 6.4 TFlops. However the conversion done by the texture units generates a small number of errors at the +400 clock. I have to back that down to about +380 to eliminate those errors. Gflops drops down to about 6250 at that clock.

Full sized floats run fine at +400 and at 6.3 TFlops. Memory is still the bottleneck there.

Native 16bit floats is probably the best option for speed but I haven’t tested that. It should perform at the 6.4 Tflop level.

I did try a block remapping and that ended up not helping and actually hurting L2 cache hit rates. Even though I was able to drop by 33% the total amount of memory accessed simultaneously from all 16 SMs, the default scheduling (which is just a simple block id ordering) is better. This is because two of the blocks in this pattern have 16 degrees of reuse, with the rest only 2. And L2 is better able to pick up on those than it is if you increase the overall amount of reuse, but distribute it in a less concentrated way.

However, there may well be cuda programs where a block remapping could benefit cache use, but sgemm is not one.

As for DP performance, I think you’re better off waiting for Maxwell Tesla class hardware. Introducing non FFMA ops into the loop really kills the performance. I didn’t read that paper but I’m guessing lots of non-ffma math is required for this technique?

The Kahan method has considerable overhead. A multiply converts to about 7 operations and an add into 16. But compared to the horrible 1:32 DP rate of GM104 it could indeed be a useful technique. Native DP is still better in precision and elegance though. I often use Kahan methods in the “accumulator” case of summing billions of regular FP or DP numbers where I worry about losing precision in the final sum. You can reduce that use down to just five operations.

Study the double-double implementation for CUDA.. The code is short and you can learn a lot about the technique from the clean and concise implementation. It’s also great for study since it makes the power of FMA obvious.
It’s easy to convert that code into a floatfloat version for GM104 since the same rounding and FMA codes are used for both float and double.

I see you’re working on 8-bit / 16-bit input data for your application. It might
be interesting to modify your SGEMM kernel to utilize the XMAD
instruction instead of the FFMA instruction. This is a 16-bit
multiply, 32-bit accumulate instruction, and operates at the same rate
as FFMA. I suspect that this would increase the power efficiency
since XMAD should consume less power than FFMA.

Yes, the XMAD and FFMA both have a pipeline depth of 6 clocks, and both operate at full throughput. However XMAD does it’s computation in multiple steps. Even with 16 bit source values mad.lo and mad.wide compile into these three XMADs:

mad d, a, b, c, x;

XMAD.MRG x, a, b.H1, RZ;
XMAD d, a, b, c;
XMAD.PSL.CBCC d, a.H1, x.H1, d;

I haven’t studied the input/output relations yet of each step so perhaps fewer are required if certain things are known about the sources. For example if one of the values is a 16 bit immediate then only two instructions are used:

XMAD d, a, 0xffff, c;
XMAD.PSL.CBCC d, a.H1, 0xffff, d;

Though I wouldn’t be surprised if that XMAD.PSL.CBCC isn’t required since if “a” is 16 bit, a.H1 will be zero (the upper half of the register vs H0 the lower half).

My interest in using the normalized floats was to test the sgemm implementation with less required bandwidth. No code changes are required to the kernel, just some simple changes in the host code. But I agree a pure XMAD implementation could be interesting, particularly if only one XMAD is really needed.

I just tweaked my synthetic benchmark and measured the power difference between pure FFMA and XMAD instrctions:


So FMMA is actually the more power efficient instruction. Oh and on GM204 Nvidia is still robbing us of 2% of our clocks either before or after kernel execution.

So much for my lower-power theory then.

When fetching the 2 16-bit operands for the XMAD, are you fetching from a packed 32-bit register, e.g., first half and second half? This might make a difference versus fetching from two unpacked 32-bit registers.

That synthetic benchmark is written in raw assembly (sass) and was just the XMAD instruction with no flags and on full 32 bit registers. Let me know if you want me to try something more specific (if you can give me the sass dump of what you want I can run it in a large unrolled loop for enough iterations to determine accurate throughput and power mesurements).

How about something like

XMAD a, b, b.H1, c

Which I read as meaning take the first half of register b (16-bit), multiply by the second half of register b (16-bit), and add it to c (32-bit) and accumulate to a (32-bit).

It would also be interesting to what effect the .reuse suffix has on power. I assume this allows to reuse operand collectors for lower power?

I took a look at you maxas project, but it doesn’t look like you have included the 8-bit load variant of SGEMM in there yet. Is this on your plan to do? Also I’m curious, what’s the power difference between this and the fp32 variant - I guess the big reduction in memory traffic has a noticeable effect on power? How much more power does it consume than a simple FMA burn test?

Apologies for the mound of questions (please read it as enthusiasm)

Anybody have any more performance metrics from the GTX 980?

Been waiting the the EVGA overclocked version to be for sale, but still ‘out-of-stock’ on NewEgg. Also they seem to be increasing in price.

I am curious how it performs on a more memory-bound problem.