Paper on performance difference between Kepler Titan Black and Maxwell Titan X

Thought this was an interesting paper, mainly because there was a big difference between the expected performance difference, and the actual performance difference;

https://docs.google.com/viewer?url=http://arxiv.org/pdf/1511.00088v1

From page 5 of the paper;

…the peak DP performance for TX is 1/7 of that for TB theoretically.
Hence, theoretically we expect that the NPR code runs 7 times faster on a TB GPU. In Fig. 3, we
present the DP performance of the NPR code on TB and TX GPUs. To our big surprise, it turns
out that the TX GPU outperform the TB GPU by 5.0(6)%.
What is the reason for this unexpected surprise? The answer is that our NPR code is dominated
by the data transfer between the GPU global memory and the CUDA cores. Hence, even though
the TB GPU has 7 times more DP calculation power than TX, the TB GPU cannot feed the data
into the CUDA cores as fast. The bottle neck is on the data transfer in the case of the NPR code.

That’s entirely possible for a memory-bound application, although I wonder whether they double-checked to make sure that they are operating the Titan Black in “DP-heavy” mode rather than “DP-lite” mode, which is how it comes up by default, as far as I know (I have never used Titan Black).

Similar effects can be seen in compute-bound code, where for example the throughput in DP math functions doesn’t track the difference in DP peak FLOPS closely because about a third to half the instructions used for the math function code are non-DP instructions (e.g. SFU, integer-ALU, branches, constant loads).

In a nutshell, scaling between devices depends heavily on where the bottlenecks are in the code.

Yeah, the GPU operation modes can lead to nasty surprises. I’ve seen average GPU clocks ~20% lower than they should be due to a combination of GOM + application clock setting (officially not unsupported until 352.xx drivers, thoguh “patched” NVML did enable it in the past).

Luckily this silliness of crippling GeForce cards’ through software just because they’re not Quadro or Tesla (i.e. they do not cost as thousands of $) seems to be coming to an end.

It appears you are advocating that DP capabilities be minimized on all consumer GPUs as unnecessary cost, and to ensure clear market differentiation is baked right into the hardware to avoid “silliness”. From what I can see happening with Maxwell, it seems NVIDIA largely agrees …

No, not sure why did my comment read as me advocating even stronger market differentiation? I’m not a fan of a future dominated by strong differentiation between consumer and professional hardware that Maxwell is hinting, but I understand that’s it’s a complex equation with factors ranging from how to use silicone efficiently to which market pays most.

However, I was referring to drivers and monitoring/management software having become a bit more sane recently. Features and capabilities that are clearly common between cards from the consumer and professional line (e.g. GK110 in a GTX TITAN and a K20) were disabled in software - or simply not even considered when developing the software stack. For instance, the GTX TITAN was advertised as the ideal developer card for those who target the GK110 Teslas, but a number of factors showed just how little thought went into ensuring that this is not just a marketing statement:

  • the inability of making the card's clocking behavior consistent (without GOM and application clocks set to max the card runs slower and less consistent)
  • the lack of monitoring capability in NVML (until the 352 series drivers most nvidia-smi features were blocked for all GeForce)
  • the lack of sane way of allowing its cooling to function properly (the fan never goes go above 60% RPM and the card quickly throttles unless one runs X server and uses the ).

As soon as we have dynamic clocking that is based on a GPU’s temperature and power draw, it cannot be made consistent across cards, as GPUs of the same type may operate at different temperatures (in a rack, the machines at the top generally tend to be warmer than those at the bottom), and manufacturing tolerances lead to variations in power draw. These manufacturing variations have increased as silicon feature sizes have shrunk: etching away +/- 2 atom layers matters less when considering 50 atom layers for a feature versus 10 layers.

Application clocks can address this partially by dialing up a fixed clock for all GPUs, but there is still a possibility of hitting power or thermal limits that force throttling if the chosen AC is high. Running all cards at default clocks, without any clock boosting, is one way of ensuring that all cards run consistently, without throttling, under all workloads, as long as vendor-specified environmental operating conditions are met.

Market differentiation is one way to recoup the incremental cost of offering “compute”. As discussed in a forum thread just the other day, there are GPU hardware features that are, exclusively or predominantly, only needed for “compute”, such as shared memory or double precision units. That is added hardware cost. Then there is a large NRE cost for software. Other processor companies ask their customers to pay considerable sums for their compilers, tools, and libraries, which is a competing approach of defraying that cost.

Apparently, market differentiation rubs some the wrong way when it is achieved with the help of software, they speak of “crippling”. The logical conclusion seems to me that once overall volume can justify the cost of doing so, it is preferred to bake the market differentiation into the hardware. While the end effect on cusomers is largely the same (more features cost more money), it fixes the PR issue. This may be what we are observing with Maxwell, at least I tend to think that is the case.