Maxwell: The Most Advanced CUDA GPU Ever Made

Originally published at:

Today NVIDIA introduced the new GM204 GPU, based on the Maxwell architecture. GM204 is the first GPU based on second-generation Maxwell, the full realization of the Maxwell architecture. The GeForce GTX 980 and 970 GPUs introduced today are the most advanced gaming and graphics GPUs ever made. But of course they also make fantastic CUDA development GPUs, with full…

am wondering if the warp scheduler is still 0 overhead scheduling ready instructions.

and if the L1 cache is similar to Kepler that it is only reserved for local memory accesses such as register spilling, not for global data accesses.

im guessing th Fp32 to Fp64 ratio for upcoming cards like GM200 for tesla/quadro/titan will be back to 1:2?

A question on compiling the existing code for Maxwell: if I use CUDA libraries, like cuBLAS or cuFFT, should I switch to newer versions of these?

Sorry, I can't comment on unannounced products.

In general it is in your best interest to upgrade to the latest versions of CUDA libraries. Every release contains new improvements and performance optimizations, as well as new features.

That's fine, I understand :-)

Very impressed by the performance utilization we're getting out of these cards!

Some users are reporting up to 98% utilization on SGEMM OR 6.35 TFLOPS on an overclocked GTX980, probably some kind of record!

L1/Texture cache is now unified and the same and local memory spilling is handled by the L2 instead.

"Local loads also are cached in L2 only, which could increase the cost of register spilling

if L1 local load hit rates were high with Kepler. The balance of occupancy versus

spilling should therefore be reevaluated to ensure best performance. Especially given

the improvements to arithmetic latencies, code built for Maxwell may benefit from


Look like the SM 5.2 also works well on the 750 Ti, but give similar performance on it, unlike the 970 (compared to SM 5.0)

Any chance sample code with VXGI implementation will be available for all developers?

I'm not sure what you mean. I suspect you mean you are compiling with -arch sm_52 but running on a 750 Ti, which is sm_50. All that means is that your program will JIT the PTX stored in the binary to sm_50 before running, so naturally performance should be similar to compiling with sm_50 explicitly.

See this post:

Im not sure, when i compile with both 50 and 52, the 750 Ti seems to use the 52 variant, which is a bit slower on it... I already seen the compile lines when the JIT is used, and its not the case.

Correct me if I am wrong, but, considering that each quadrant only has 8 LD/ST units, I think a memory operation (warp) would take 4 cycles to be completed, wouldn't it? How does that fit the scheduling and dispatching?

Hi, My interest is in massively parallel computing. Do you know if a Maxwell Tesla (compute only) card will appear any time soon ? Do you advise to buy current Tesla K40 or wait for a Maxwell Tesla card and try to get around with one or several GTX980 ?

Thanks in advance for your answer!!

Hi Mark, I was just wondering that how can I enable the "Local L1 cache" of maxwell second generation(sm_52 arch)? I want to use L1 cache to improve the performance when register spill to local memory, is that possible? I search on the internet and only I can see is --Xptxas -dlcm= ca. But this is for global L1 cache, so what is the specifier for local l1 cache? Thank you very much!

Hi YTROCK, on Maxwell locals are cached in L2 only. You can query whether a given GPU supports local caching in L1 using the localL1CacheSupported device attribute. Full details here:

Got it, thank you very much!

the cache hierarchy was changed,I want to know the size of L1cache in Maxwell.I appreciate it if you reply me . thank you very much

Hi all, I have just bought Nvidia Jetson tx1. It has Maxwell GPU with 256 cores. Does anybode know where I can find the documentation for this specific Maxwell GPU with 256 cores. Thank you in advance.

Hi Hong Quan, yes, these GPUs should support MPS, however keep in mind that MPS requires 64-bit Linux. Also, some of the GPUs you mention are laptop GPUs which only have a small number of SMs, so they might not be suited well to MPS. Also, you can't run MPS when the X Server is running, which may make it difficult to use it on a laptop.