What's new in Maxwell 'sm_52' (GTX 9xx) ?

Ok, I posted some exhaustive benchmarks for sgemm here (scroll to bottom):


I didn’t realize I was actually performing 4.8% faster than cublas at the 4094 matrix size or as much as 25% faster for smaller matrix sizes. Read the rest of the lengthy article to find out how.

CudaaduC: there are memory bound plots in these benchmarks. You can see how well the L2 is able to hide this up to about the 4096 matrix size. As for advice, if you can hold out a couple months, GM200 is just around the corner with more cores and a 384 bit bus:


Maddy: indeed the reuse flag drops power levels, enough so that an implementation with higher reuse coverage can run at a higher sustained clock. I’ll have the normalized float code checked in shortly, though it’s pretty simple stuff. You just change the size of your memory allocation, initialize it appropriately, then pass in a different flag for the texture format. The kernel requires no changes.

I’ll see if I can play around with XMADs today. The first thing I want to understand is how all the flags work so I know what’s possible with that instruction. I strongly suspect that Nvidia has optimized the hell out of FFMA and XMAD isn’t likely to touch it efficiency wise. On a related note, I confirmed that FFMA.FTZ consumes the same amount of power as FFMA and is hence of not much value.


A really great README/blog for maxas BTW

Maddy, sorry to take so long on this one… some things came up that I needed to address. But yes, your xmad example is correct. XMAD multiplies at 16 bits and accumulates at 32. And the multiply can work off of the hi or the low portion from either operand. So lets say you loaded two values from each of A and B and kept them packed in 32 bit registers (R1, R2). You could do a portion of a gemm like compute like so:

XMAD R3, R1, R2, R3;
XMAD R4, R1.H1, R2, R4;
XMAD R5, R1, R2.H1, R5;
XMAD R6, R1.H1, R2.H1, R6;

No additional XMADs are needed. Then after all the XMADs and just prior to writing out the results you could then round and rescale the results back to the 16 bit size (assuming something like fix point math).

Even though FFMA operates at a little less power, you’d be saving power overall by being able to halve the bandwidth. And it would run just as fast (or faster because of improved cache performance with half sizes).

Also keep in mind you’d have to use my assembler to write this code. There is no way from ptx to use this mode of XMAD, which is kind of scandalous considering how useful this computation could be.

If you wanted 8 bit math, you’d need some I2I conversions from 8 to 16 bit just after your loads hitting your performance a bit, but not by a huge amount.

Oh, and I’ve updated my sgemm document with some new details and added a plot to the large size matrices that’s interesting (higher performance at lower occupancy for bandwidth bound kernels)


And I’ve added my much delayed and promised document on Nvidia’s control codes (the secret sauce that lets you actually write functioning code in sass)



Nice write-up on the control codes. Skimming now!

BTW, it looks like you can generate an XMAD with a PTX “mad.wide.s16 s32,s16,s16,s32” instruction.

My test snippet spits out several XMADs though (didn’t study it – just pulled it out of my snippet dump).

Also note that the PTX “vmad” instruction looks very similar but can scale, saturate and add “+1”. Not sure if this is emulated in Maxwell.

Yes, those 3 XMADs are the problem. When using 16 bit multiply source operands (the ADD and destination operands can be 32 bit) the additional 2 instructions are effectively a no-op. Then there’s also no way to specify the high portion of the register (though maybe you could use a vector source in ptx).

Odd that ptx has all that fine grained control for vmad but not for mad as well. Or maybe vmad actually maps onto XMAD… the video instructions are one area I have not touched with the assembler.

Ok, I just added full VADD opcode coverage to the assembler. Interestingly it’s full throughput when both operands are are using h0 or h1 (which probably means it’s just an alias for XMAD in this mode), and then becomes half throughput when at least one operand is b0,b1,b2,b3 or if you’re using the scaling factor (which makes sense since shift ops are half throughput) or if you’re using plus1 or saturate.

It does allow you to do signed arithmetic which XMAD doesn’t support at half sizes.

The 16 wide full throughput version runs at the same TDP as XMAD and slightly above FFMA. The 8 wide half throughput runs at less power… but half as fast.

I was hoping it would have some use in fixed point gemm and I guess the signed feature is nice, but the half throughput kills it for 8 bit precision. I think I’d rather use I2I there or maybe just floats.

One guy claims to have rewritten the Blake hash function in Maxwell assembly language and he reports a 400% increase in hashing speed (for this hash function only). Too bad he’s keeping the software private for now. [url][ANN] cudaMiner & ccMiner CUDA based mining applications [Windows/Linux/MacOSX]

If he used maxas I wouldn’t be surprised. Figured that community would be the first to jump on it, even in the somewhat raw state it’s still in. Writing software that’s easy for other people to use versus just yourself is an order of magnitude more work.

Scott, finally got a chance to read your control code writeup. Good stuff.

You know there is someone at NVIDIA who can’t sleep at night knowing that there is one last bit available in that 64-bit Maxwell control line (3 * 21 = 63). :)

Looking to find the time to use MAXAS!

I’m curious, how hopeless is CUDA C for writing fast SGEMM, in your opinion: 70% efficiency like PTX?

I was able to hit 75% with lots of voodoo optimizations to get the reg count below 128. But ptxas tends to optimize away any shared memory double buffering. Ptxas register allocation isn’t that intelligent when using vector loads. And obviously the kind of bank conflict and reuse optimization I used isn’t possible in ptx.

Oh, and I consider ptx and cuda c to be pretty much equivalent. There is almost nothing to gain in using a pure ptx kernel. Nearly all compiler optimization is done by ptxas.


Re: Nearly all compiler optimization is done by ptxas.

This may hold for this particular code, but not in general. PTXAS is mainly responsible for machine-specific optimizations, and in particular register allocation and instruction scheduling. It also performs a limited number of general-purpose optimizations such as common sub-expression elimination and loop unrolling. Generally, the NVVM frontend performs the bulk of the machine-independent code transformations, and as it is based on LLVM it incorporates a state-of-the-art set of optimizations.

One challenge for the CUDA toolchain is that PTX serves as both a public programming interface (a portable assembly language) and as an intermediate tool-chain representation. Traditionally, intermediate compiler representations are less public, more information rich, and more easily modified. This means that the use of PTX necessitated by the lack of binary compatibility between the hardware ISAs can, on occasion, lead to some “friction losses” inside the compiler. Another challenge is that PTXAS needs to be adjusted for each new ISA, and some time is required for the new components to reach full maturity. Those challenges notwithstanding, I would claim that the performance stability of the CUDA tool chain is much improved compared to the early years of CUDA, and the most recent transition to Maxwell was handled very smoothly.

I was the author of early versions of CUBLAS, and an “efficiency” of 75% on GEMM for compiled CUDA code seems par for the course. It has been a very long time since I last tackled this problem on a CPU (ca. 2000, for the AMD Athlon processor), but if memory serves, compiled C/C++ code without the benefit of inline assembly language and/or special intrinsics achieves similar efficiencies on CPUs. If anybody has published numbers on this I would love to see them; surely CPU SIMD architectures and auto-vectorizing compilers have evolved in the past 14 years.

True, gemm is basically just a streaming calculation which is mostly about optimizing scheduling and register allocation. From my explorations of investigating the mapping from cuda c to ptx, it seemed mostly just a simple one to one mapping. But I wasn’t looking at anything particularly complicated.

Scott, a question for you. I’m curious why you assert that using streams will increase overall utilization? Surely, there is nothing to be gained versus just having a bigger grid? In fact I would have thought it was always better to favour bigger grids over concurrent kernels since this minimizes CPU-GPU interactions (e.g., kernel launch overhead).

I haven’t actually investigated this, but I’m pretty sure if you have a bunch of small matrices to multiply, it’s better to use several streams using a kernel designed for large matrices that can run at full throughput. Even though each matrix multiply wont fill the gpu with work by itself, it will with multiple streams.

The alternative would be to work with a kernel that is designed for small matrices and can fill up the gpu with lots of blocks… but those blocks wouldn’t be capable of full throughput due to higher bandwidth needs and reduced register blocking and hence ILP.

Though with multiple streams you dilute your L2 cache considerably. So, I don’t really know.

There’s a good chance I may end up benchmarking this for work and I’ll let you know what I find.

I believe, the optimal approach would encapsulate both of these approaches: use a kernel that is designed for large matrices, but add batching supporting directly at the kernel level so you avoid the overhead of streams (kernel launches). I.e., in GEMM, only 2-dimensional tiling is considered (threadIdx and blockIdx) but the z dimension is unused. Trivially blockIdx.z could indicate the batch number, and all 2-d tiling at the block level would remain unchanged. The only difference would be the addition of an offset, e.g., zlength_xlength_y when indexing into the matrices in global memory.

The benefit of this approach over the stream approach would be especially be apparent in the small-K regime, where the kernel launch latency could become significant.

I like that approach a lot. I think I may end up building exactly what you describe.

Great, looking forward to seeing the results :)

Very awesome work! What is the best way to cite your work? Would something like this do -

Gray, S. 2014. MaxAs: Assembler for Nvidia Maxwell Architecture, Google Code Archive - Long-term storage for Google Code Project Hosting.

Works for me…