So what's new about Maxwell?

This is the Maxwell facts thread. Let’s end the speculation and rumours. It’s just one hour before the press embargo is lifted. Let’s start collecting deviceQuery dumps, performance numbers, benchmarks, instruction throughput figures etc. all in one place.

Also: what hardware features are new? How can we make use of them?

Lights candles, spins prayer wheel: “shared.mem, shared.mem, …”

…and Tom’s hardware has it live: GTX 750 Ti Review

EDIT: it’s cool when your own software is getting used as a benchmarking tool by Tom’s hardware. I am talking about cudaminer, of course. ;)

two more reputable sources:
Guru3D review
AnandTech review

And here’s the article in Anandtech: http://www.anandtech.com/show/7764/the-nvidia-geforce-gtx-750-ti-and-gtx-750-review-maxwell/

Presumably, Maxwell is the sm_50 architecture that was added to nvcc in the CUDA 6 release candidate. (I wonder what happened to sm_40?)

So there’s now 64 kb shared memory per (full) SMX unit. Which is a step up from 48kb previously. Would anyone know details if shared memory bandwidth has been improved?

EDIT: I’ve just ordered an MSI 750Ti card.

Christian

I’m curious to see how Maxwell improves interaction with CUDA 6 managed memory.

(BTW, I’ve submitted a pull request for managed memory support in PyCUDA. Hopefully that will be merged soon.)

Edit: I have an EVGA 750 Ti en route now. :) Interesting that I had to go direct to the manufacturer’s page to order, rather than the usual computer parts vendors I buy from.

We need a deviceQuery - best taken with CUDA 6.0 RC SDK! Anyone?

Remember that __launch_bounds() and shared declarations are verified by the compiler at compile time.

NVCC -arch sm_50 and a test program reveal:

  • 64K of shared memory
  • 64K 32-bit registers
  • max threads per block remains at 1024
  • max blocks per SM remains at 64 is increased to 32
  • max threads per SM remains at 2048

So the only change is increased shared memory per SMM unless I did something wrong.

Furthermore, dumping the SASS of kernels compiled with sm_50 reveals some new instructions like DEPBAR, SYNC and XMAD. No idea what they do.

Like @seibert, I ordered directly from EVGA and picked up the stubby 750 Ti SC.

one more mining-centric review on Crypto Mining Blog

and 300 kHash/s with overclock to the limits: Crypto Mining Blog

I’d love to have some information about how to make use of the ARM processor that is supposed to be inside this Maxwell chip. Is it only used internally by the driver to offload some things (like dynamic parallelism) or is it also accessible to the programmer?

@cbuchner1, be honest, just how many 750Ti’s did you order this morning? :)

Maxwell’s L1 is no longer split with shared memory. From the SMM diagrams, it seems like L1 and the texture cache have been merged in some way.

I have also ordered a EVGA GTX 750TI SC.

Also today, the Titan Black launched as well, which is aimed at us CUDA guys.

I assume Dynamic Parallelism & Hyper-Q will be found in sm_50. The last page of the AnandTech article states that they are “baseline” features in Maxwell.

2MB of L2 is very interesting as well!

Now that I have to pay for CUDA devices out of my pocket, I very much appreciate that NVIDIA decided to lead the Maxwell architecture release with a low-end desktop card. :)

one. for review and development purposes.

A friend of mine will be getting 10. His first mining farm.

Because of some product expectations I have I will hold off buying single GPU cards. I’ve bought Asus MARS recently and I want this kind of device for mining, but definitely Maxwell based. We need more hash power density - up to the power limit that a single PCI express card can provide (would that be 250 Watts?)

Christian

Awesome! Hopefully we see GM10x follow-on cards very soon.

Anyone knows if the GTX 750 Ti has Dynamic parallelism? At https://developer.nvidia.com/cuda-gpus appears with compute capability 3.0…
I want to try the new architecture but I need this feature.

Not directly related to Maxwell, but I’m pleased to see improved code generation in CUDA 6.0. After recompiling my image processing codes, the instruction count reduced by 12% and kernel time by 22% !

One thing I’ve always been bothered by is the very inefficient array indexing code. Unlike x86 which can compute

index * scale + offset + constOffset with a single load/store instruction, CUDA actually uses multiply and add instructions to do it (you can translate the array index into an induction variable, but that increases register use). 64 bit addressing makes it worse by doubling the # instructions.

It took me a while to realize why my simple code had 2 multiplies for each memory load:

addressLow32 = IMAD(index, scale, pointerBaseLow32) // compute lower 32 bits of address
addressHigh32 = IMAD.hi(index, scale, pointerBaseHigh32) // compute upper 32 bits

With CUDA 6, the address generation is improved to:

addressLow32 = IMAD(index, scale, pointerBaseLow32)
sign = index < 0 ? 0xffffffff : 0
addressHigh32 = IADD.X(sign, pointerBaseUpper32)

which could be better for throughput, but makes inspecting assembly code even harder by littering it with more address calculation.

We should know as soon as someone gets one and prints the device caps. It’s likely sm_35 or the (new) sm_32. It is not the sm_37 buried in the CUDA 6.0 headers (which provides more shared memory than the 64K GM108 is known to have).

Even GK208 is sm_35.

One (small) clue is from the GM107 white paper, which says “our first-generation Maxwell GPUs offer the same API functionality as Kepler GPUs”. That doesn’t tell us anything really except it’s sm_3x.

Maxwell is Compute 5.0, I know that much from cudaminer screenshots that were sent to me.