So what's new about Maxwell?

“Does it fail if you dynamically allocate shared?”

Yes, unfortunately (it fails even if you don’t touch any memory). I also tried using cudaDeviceSetCacheConfig() with various parameters but it didn’t help.

Wow, I got my 750 Ti this afternoon and only ordered yesterday. A good benefit of living in the bay area.

File a bug report! Such a failure is indeed a bug, since either the compiler is wrong in that Maxwell supports 64K, or the driver (?) is wrong in actually executing the code.

The NVidia response will also be interesting regardless.

Or, perhaps Mr. Juffa could help us with some sm_50 answers here? (Looking up expectantly…)

Register spills. But I don’t know what else would be cached. I’m surprised local memory isn’t in L1 though… the (rare) use of local memory is for small per-thread arrays, and the L1 cache seems like it’d be the best place for such data.

Most likely is that Maxwell is indeed simplifying things to keep the hardware as clean and focused as possible. It’s obviously paid off in GM107’s density and power.

Maybe it’s 64KB total across two or more blocks and 48KB max per block?

Inquiring minds want to know! :)

Correct, it is 48Kb per block and 64Kb per SM.

Thanks for the confirmation.

What else can you say about sm_50’s differences from sm_35’s?

A tuning guide will be released soon.

cudaminer, that’s the name. Don’t forget that the currently fastest kernels were submissions by David Andersen and nVidia. It’s not just me who has authored this miner.

Ah sorry I go the name mixed up with “the competition” :-)

The website referred to it as “Christians CUDAMiner”.

Here the 60 watt 750 TI mines litecoins on par with the Radeon 265 @ 150 watt:

What about hardware access to the main system memory ?
If I remember correctly, this should’ve come with the Maxwell architecture, but there is nothing about it anywhere ?

Hardware access to main system memory has been available for some time in the form of zero-copy. There are cuda samples which demonstrate this. Perhaps you are referring to Unified Memory, which is a new feature in CUDA 6 (you can play with it using the CUDA 6 RC available now to registered developers) and is supported on devices of cc 3.0 or newer, so that includes GF750/750Ti, which as already indicated are cc 5.0

The feature you are referring to is called “Managed Memory” or “Unified Memory”, depending on what you are reading. It is described in the CUDA 6.0 C Programming Guide, Appendix J. For now, you have to be a registered developer to get access to the CUDA 6 release candidate documentation.

I’m pulling together some observations and measurements related to managed memory to post to the forum soon. I’m hoping to wait until Saturday, after my GTX 750 Ti is delivered so I can compare the performance to Kepler.

From GTC 2013.

“Our next-generation GPU architecture will bring a number of hardware improvements to further increase performance and flexibility. On these GPUs, the system allocator will be unified, meaning any memory (whether allocated with malloc() or on the stack) can be shared between the CPU and GPU.”

Just got my 750Ti.

While prepping some initial benchmarks and demos, I happened to notice that CUDA 6.0RC has new build options to NVCC in the SDK sample Makefiles. Note the new “compute_50” as a code generation option, for “SMXX”. Fodder for speculation…

# CUDA code generation flags
ifneq ($(OS_ARCH),armv7l)
GENCODE_SM10    := -gencode arch=compute_10,code=sm_10
GENCODE_SM13    := -gencode arch=compute_13,code=sm_13
GENCODE_SM20    := -gencode arch=compute_20,code=sm_20
GENCODE_SM30    := -gencode arch=compute_30,code=sm_30
GENCODE_SM32    := -gencode arch=compute_32,code=sm_32
GENCODE_SM35    := -gencode arch=compute_35,code=sm_35
GENCODE_SM50    := -gencode arch=compute_50,code=sm_50
GENCODE_SMXX    := -gencode arch=compute_50,code=compute_50

Roman numerals. GENCODE_SMXX is actually GENCODE_SM20. :)

I bet it’s SM_55? SM_50 + Denver?

My best guess:

I don’t think this has any predictive qualities. The way I read this is that it is simply building a fat binary with machine code for all existing architectures, adding PTX for the latest shipping architecture so the object code can be JITed on future architectures. In other words, SMXX appears to be simply a placeholder for any unknown future architecture, and its value should always be the latest shipping compute_xx architecture.

Your explanation is so clear that it’s undoubtedly correct. And it makes me realize that for 4 years now I’ve been compiling my CUDA binaries without JIT future-proofing them; I somehow thought the PTX was always baked in. There’s always more to learn! Thanks, Juffa-sensei!

Depending on how you built your fat binaries, you may have actually been baking in N different PTX versions in addition to N different SASS versions.

I forget whether simply using -arch generates both SASS and PTX, and I do not have a machine with CUDA in front of me to try, but I think it does? The -gencode flag as used in the SDK Makefile provides the fine-grained control needed for the optimal fat binary: N versions of machine code plus one version of PTX (the latest) to future proof the binary.

I have been wrestling with the new CUDA 6.0 drivers, which seem to cause fatal instability in my MSI NF-980a motherboard’s onboard Nforce GPU. (An archaic motherboard that has a great redeeming feature of supporting 4 compute GPUs without needing to waste one for display.). Nonetheless I am able to run command-line CUDA programs as long as I don’t use the onboard video.

A good initial observation, confirmed by all the reviews: the Maxwell GTX 750 TI (superclocked EVGA) has an amazing idle power use of only 3 or 4 watts, measured from the wall. This is even lower than the reviews, but it may be because there’s no display hooked up to it. There’s probably a 1 or even 2 watt uncertainty anyway. But its idle power is ignorable… something which is nice to know when you have a lot of GPU compute nodes. I’m using the standard Kill-A-Watt to measure power use.

I’ve only run one test so far (due to the fighting with the drivers crashing my machine). I used the standard CUDA nbody program’s benchmark mode. This is not the best CUDA benchmark but it’s still very interesting. There’s good discussion about using nbody as a benchmark (with caveats that it’s not tuned for a specific architecture, it is not indicative of all CUDA performance, etc.) But its performance is still very interesting.

I ran nbody with a variety of body counts, since the GPU performance has often peaked for certain worksizes (see that linked thread for discussion). I first compared with a K20c and then the 750Ti. (Both are in the same machine and running the same CUDA 6.0 compiled nbody executable.) These are all floating point throughputs in nbody’s teraflops, rounded to two figures.

Bodies       K20       750Ti

 16384       2300       830
 32768       2400       870
 65536       2300       890
131072       2500       890
262144       1900       890
524288       1800       880

The interesting trend to notice is that the Kepler K20c clearly has a “sweet spot” where the load seems to match the GPU’s configuation to deliver best throughput. The Maxwell 750Ti is much more robust and seems to handle itself over a diverse workload much more consistently. This might be a clue that Maxwell’s scheduling (at block and/or warp level) is working with fewer hiccups or stalls, though that’s still just speculation.

Wall power use was unfortunately meaningless since nbody uses CPU polling so the CPU draws a lot of power. The wall power increase over idle for the K20 was approximately 375 watts. The increase with the 750 Ti was 135 watts. These don’t tell us at all what the GPU power use is.

Though an interesting power clue is that the K20’s power use clearly was irregular over time; it would range plus or minus 100+ watts every second from its average while computing. The 750Ti has a steady power use with a variation of only ± 5 watts. Again speculating, is this an artifact of the K20 stalling on some parts of the workload, and the 750Ti working steadily? The power was measured on the largest workload of 524288 (which the K20 was clearly no longer working at peak throughput.)

I don’t have more data yet, mostly due to the fact that my own applications are hard to run with the new driver bug on that motherboard. I’ll try to report more later.

My tentative initial feeling is that Maxwell, even in this first generation, is a polished CUDA machine as long as you don’t need DP throughput.

What is the total size of all of your bodies? Do you see a drop-off once all the bodies don’t fit in the new huge L2 cache?

I’m guessing the K20 drops off earlier as things don’t fit in on-chip or L2 memories?