Tesla K40 vs. Quadro M6000 vs. GeForce Titan X

Hi all,

I am new to this forum and after searching its topics (and before that, the web), I could not find an answer to my questions. So, I decided to create this new thread.

I am trying to evaluate which is the best gfx card to use for CUDA programming when just single precision floating point operations are important.

First, I am going to list the features and pros & cons, I found out, then, I am going to ask questions I am in doubt. I hope, this is the proper forum for asking these questions and, more important, someone can reliably answer my questions ;)

The pros (+) and cons (-) I found out about Tesla K40, Quadro M6000, GeForce Titan X are as follows:

Tesla K40 - Features: 2880 Cores, 12GB Ram, 288 GB/s, 384-bit bus, Boost Clock: 810-875MHz, Core Clock: 745MHz, 6GHz GDDR5, 4.29 Tflops SP

Pros:

  • pure compute capabilities (no video output)
  • reliability (since produced and certified by NVIDIA, long-time warranty, strenuous long time zero error tolerance testing)
  • Memory error protection (ECC)
  • decent double precision performance (1.66 Tflops)
  • Hyper-Q
  • two DMA engines for bi-directional data copying (while a kernel is operating)
  • TCC driver for Windows

Cons:

  • “old” Kepler GPU architecture (a Tesla-Maxwell card will most likely not appear due to Maxwell´s poor DP performance)
  • around 5000€

Quadro M6000 - Features: 3072 Cores, 12GB Ram, 317 GB/s, 384-bit bus, Boost Clock: 1140MHz, Core Clock: 988MHz, 6.6 GHz DDR5, 6.07 Tflops SP

Pros:

  • reliability (since produced and certified by NVIDIA, uncertain: with tests and warranty equal to Tesla?)
  • Memory error protection (ECC)
  • two DMA engines for bi-directional data copying (while a kernel is operating)
  • latest Maxwell-2 GPU architecture
  • highly tuned (video?) driver for professional applications
  • TCC driver for Windows

Cons:

  • pure double precision performance (0.19 TFlops)
  • around 6000 €

GeForce Titan X - Features: 3072 Cores, 12GB Ram, 336GB/s, 384-bit bus, Boost Clock: 1075MHz, Core Clock 1000MHz, 7GHz DDR5, 6.2 TFLOPS SP

Pros:

  • two DMA engines for bi-directional data copying (while a kernel is operating)
  • latest Maxwell-2 GPU architecture
  • 1250 €

Cons:

  • produced by 3rd party companies, not by NVIDIA
  • poor double precision performance (0.192 TFlops)
  • WDDM driver (Windows Display Driver Model) – might be not a problem when used as a secondary card without display?

Questions:

  1. Is it true that the Quadro M6000 indeed has “two DMA engines for bi-directional data copying (while a kernel is operating)”? I have not found reliable information about that.

(Was: As I understand it, Hyper-Q is important when using multiple GPUs in an CUDA compute environment. Is there something similar for Maxwell GPUs? Is it in any way related to SLI?)
Correction:
2) As I understand it, Hyper-Q is important when multiple CPUs/Processes access the same single GPU. Is there something similar for Maxwell GPUs? Is it true, that it is completely unrelated to SLI?

  1. Is it true, that on windows, the WDDM problems disappear for a GeForce (Titan X) GFX card, when it is used as secondary card with no monitor connected. Can the Titan X be used with the TCC driver in such a case?

  2. When only single precision floating point operations are important in hand-written CUDA programs or with CUDA related libraries like Thurst, cuBLASS, etc, does a Maxwell card like Quadro M6000 or GeForce Titan X achieve more performance than the Kepler card Tesla K40? (I am aware, that there is also a Tesla K80, which besically is a Tesla K40 with 2 GPUs. However, for a start, I just want to program on 1 GPU).

  3. Is it possible at all to run a Quadro M6000 or Geforce Titan X in a server (not a desktop workstation) without a monitor? For develoement, it might make sense to start with a Titan X and later switch to a Tesla/Quadro when the product is mature enough.

Thanks in advance for your help!

EDIT: Applied some corrections to question 2) and to pros/cons of Titan X due to helpful responses in this thread.

Maxwell 2 GeForce cards have 2 DMA engines.

https://devtalk.nvidia.com/default/topic/776043/cuda-programming-and-performance/whats-new-in-maxwell-sm_52-gtx-9xx-/post/4317593/#4317593

“Oh and one other interesting feature that I don’t think has been mentioned: ASYNC_ENGINE_COUNT=2 on this hardware. Not sure how that compares to say a 780Ti, but I think I remember it used to being only Tesla cards having two.”

Hyper-Q is fully implemented on Maxwell as well. It has the same 32 queues as Kepler GK110/GK210.

http://www.anandtech.com/show/7764/the-nvidia-geforce-gtx-750-ti-and-gtx-750-review-maxwell/24

“The increased efficiency it affords improves performance alongside the other IPC improvements NVIDIA has worked in, plus it means that some of GK110’s more exotic features such as dynamic parallelism and HyperQ are now a baseline feature.”

From what I have seen (prior to Maxwell) the WDDM problems do not disappear even if the card is not being used for any display. I could not get the non-WDDM driver working (not really supposed to be able to) and for my project that was a massive issue.

IMO there is too much hype about the issues with the WDDM driver for Windows. Even with a single GPU (GTX 780ti for example) connected to the display I have been able to always get better overall performance than in any distro of linux, with less hassle.

From what I hear the only issue is when one has lots of really really small kernels, but if your kernel takes over 1 ms then it is not much of an issue.

Maybe I have been lucky, but at my work we mostly use Windows and have simulations running 24/7 with the WDDM with zero problems. I also think the Nvidia graphics drivers are better for Windows in terms of performance and ease of use.

Not a popular opinion I know, but after 3 years of working in both environments that is my conclusion.

In regards to the OP’s question, IMO two GTX 980 gpus (EVGA ACX superclocked) will beat the three single GPUs you mentioned in 32 bit compute by a large margin.

From what I have seen, performance complaints related to the use of the WDDM driver have centered on two issues: (1) Larger overall overhead when an app uses many small kernels (2) Performance “jitter” due to the launch batching the CUDA driver performs to lower the average overhead of WDDM; this can affect apps with “soft real-time” requirements, for lack of a better word.

The rapid increase in GPU performance has caused the number of apps that fall into category (1) to increase as kernel run times have dropped. NVIDIA seems to have counteracted that to some degree by various optimizations in the driver stack over the past three years or so, but benchmark comparison with non-WDDM platforms for each release show that there is still significant additional overhead on WDDM.

I am intrigued by the remark that overall CUDA application performance supposedly is worse on Linux platforms. In nine years of CUDA use, I have never observed that (64-bit RHEL vs 64-bit WinXP followed by 64-bit Win 7). Have these differences been demonstrated on identical system platforms (e.g. dual boot) and conclusively linked to the GPU-compute portion of those application? Is it possible that the performance differences are due to other application components?

In terms of using consumer cards for heavy-duty computing work, I would suggest proceeding with caution when it comes to “vendor overclocked” models that run at faster than NVIDIA specified clocks. You may want to qualify such hardware using an extended burn-in test to make sure correct results are delivered under the heaviest anticipated load.

I have not rigorously tested in a methodical manner, but at a minimum for most of our applications(larger kernels which take more than 1 ms to complete) the WDDM does not perform worse than the same system using Ubuntu. I attribute this to the drivers being a bit better for Windows due to the large gaming market, but that is just a theory.

The critical work where accuracy matters, or for commercial products we use the Teslas with ECC. For research Monte Carlo simulations or image reconstruction/processing the consumer GPUs have been reliable so far with no problems as of yet. At least we can cover some ground with the consumer GPUs, narrow the problem space, then validate with the Teslas. So far we get the same results, but we will not count on it and continue to test and verify.

I agree that WDDM should generally not be a problem when applications avoid launching myriad “tiny” kernels and when some execution time jitter due to launch batching is tolerable. The other limitation, not exclusive to WDDM, is the GUI watchdog timer, but there are registry hacks to change the timeout.

I am running a Quadro on my Windows 7 workstation at home and have yet to run into any issues because of WDDM. My exposure to WDDM issues is primarily through posts by CUDA users on Windows (other than Windows XP, which has a low-overhead driver model) in these forums.

Thanks all for your helpful responses!

Responding to the comments about WDDM and small kernels, that was exactly the scenario I was in. Tiny kernel (basic math on 2D matrix), followed by FFT, repeat about 20 times per image pass, for 100,000 passes. Going from Win 7 to Ubuntu literally doubled the speed, as simply the setup time of the kernels was longer than the execution so after a few batches the app was simply waiting on the driver to finish setup. I was hoping that dynamic parallelism wouldve led to a device launchable FFT of some form but unfortunately it never appeared. I fully agree that getting CUDA working on Windows is almost trivial where as on Linux (as with everything Linux) it requires a bit more fiddling.

From what I see in these forums, a good number of problems getting CUDA up and running on Linux seem to have to do with the installers used with some distros. I notice that txbob recommends installation via .run file in those cases. I have used .run file installation of CUDA exclusively for many CUDA installations over the past seven or so years, and found it to be easy and painless.

The second class of potential installation issues on Linux seems to have to do with the barriers incorporated into some distros that make it more difficult to install “evil proprietary” drivers. As I see it, those issues are owed to ideology, not technology. I noted that CUDA users have several choices when it comes to supported Linux distros.

As a largely OS-agnostic user of both Windows and Linux platforms for many years, I find both platforms equally easy to use with CUDA. They just differ as to the nature of minor annoyances.

Don’t misunderstand me, I am not saying getting CUDA working on Ubuntu was hard just more-so than windows simply because of the natures of the OS. Windows literally is just write and go, whereas Ubuntu has fighting with Nouveau, making sure you dont break OpenGL etc, X server management. Unfortunately that is outside of what Nvidia can fix.

I will still do my weekly prayer for device launchable cuFFT ;)

when executing the command ‘nvidia-smi.exe -h’, I get the following output:

"NVIDIA System Management Interface – v347.88

Supported products:

  • Full Support
    • All Tesla products, starting with the Fermi architecture
    • All Quadro products, starting with the Fermi architecture
    • All GRID products, starting with the Kepler architecture
    • GeForce Titan products, starting with the Kepler architecture
  • Limited Support
    • All Geforce products, starting with the Fermi architecture
      nvidia-smi [OPTION1 [ARG1]] [OPTION2 [ARG2]] …"

The point here is: “Full Support on GeForce Titan products”. Does that mean, the Titan can be used in “TCC” driver mode on windows?

Can anybody definitely confirm (or the opposite) that a Titan X can be run in TCC Mode?

Thanks!

The Titan X cannot (officially) be run in TCC mode, or at least it naturally uses the WDDM in Windows.

Once the kernel is actually launched which OS or TCC vs WDDM does not seem to matter. If you have a lesser GPU as the display and a better one in a different slot with no video out, that 2nd GPU can be used for compute in Windows without hanging up the machine or causing any issues. CUDA generally will recognize the better GPU as GPU #0 and use that for compute.