[Multiple GPUs / Processes] CUDA Memory De/Allocation Slow

Well, there might be a economic incentive for Microsoft to change (or at least adapt) the design goals of their default driver model.

The deficiencies of their WDDM model (long kernel launch time, cudaMalloc slower by one order of magnitude for recent drivers) pushes even more customers / companies in the media & entertainment industry (editing, post-production, color correction, restoration, …) to Linux.
Because, in that industry GPU acceleration is standard nowadays, and you get much better CUDA performance in Linux. But maybe, Microsoft is just fine with that :-)

And, as I mentioned also in my bug report regarding slow cudaMalloc to NVIDIA, if you are using some post production suite like Nuke, you have to keep at least one GPU in WDDM mode because you need it for the OpenGL rendering in the suite. So, imagine how happy a customer is when he buys a Quadro P6000 for 5.000 euro and notices then that GPU-acceleration does not bring the expected benefits.

I suppose NIVIDA has good connection to Microsoft, so maybe they should tell them simply to try to reduce that performance deficiencies in a future WDDM version 3.0.

Yes, I understood this, but WDDM+TCC means for me that the WDDM one is useless for computations :( This can be seen on first and second screenshots of profiler, utilization is under 10% and so non-cheap but impressive power of P6000 is just useless.

Ok, I will report this bug in near future.

But in WDDM 1.x (up to driver 347) everything seems to work just fine. So this is regression. And why not to collaborate with Microsoft to make the world better for NVIDIA GPUs on Windows platform and for Windows in case of GPUs computations?

Seems that in such way NVIDIA will save resources and efforts in supporting two drivers.

The presumption I guess is that NVIDIA (intentionally!) don’t collaborate with Microsoft. Or that they do collaborate with Microsoft but simply forgot to discuss driver efficiency.

I guess the presumption is that if NVIDIA simply would tell microsoft to do something, it would provide the desired benefit/outcome.

It’s not possible recap all the challenges or competing technological goals and desires of NVIDIA and Microsoft, or review in a public forum the nature of communication between NVIDIA and Microsoft. My suggestion would be to file bugs for desired changes.

But hey! This is one of the most popular OS in the world! And the driver efficiency seems to be quite important for GPU company :-) I am really confused :-D

Yes, I will file this bug :(

When companies partner, like NVIDIA and Microsoft in graphics or AMD and Microsoft in the CPU field (“Designed for Windows” used to be printed on all AMD CPUs), they tend to discuss many issues on an ongoing basis.

Partners do not normally comment publicly on the nature or intensity of these discussions, those are confidential business matters. It is logical to assume that when two for-profit companies partner they still pursue their own economic goals to the maximum extent possible. It is also logical to assume that the forcefulness with which they are able to pursue those goals is positively correlated to each company’s economic strength.

Microsoft made changes between WDDM 1.x and WDDM 2.0. Only Microsoft can answer the question why they made those changes. I do not know enough about the details to even speculate. Microsoft is a giant company with long-term economic success. That is a pretty solid indication they know what is in their best interest.

Hi, Dec. 4, 2017

I would like to add to this thread. We have a supermicro GPU server which we use for running an in-house HPC applications. The server runs Win7 OS.

We have 8 Nvidia GPUs, consisting a P100, 3xK40c, 2xK20x, and 2xK20c cards.
We noticed that when we replace one of the K20c cards with a Nvidia K80 card there is an enormous slow-down of the code similar to what is being reported by Adam.

The slow down occurs when there are multiple jobs running with certain memory use attributes. i.e. for very low or high memory use the slow down is not as profound. However when the memory use is intermediate (i.e. array sizes (complex DP) are in the 131072-262144 range) the slow down is profound, could be more than a factor of 100. Needless to say we are not able to use the server with the K80 card.

We tried various drivers and compiler versions all to no avail. At this point we are resigned to returning the K80 card. But perhaps someone out there can share their experience and/or solution?