TF and Pytorch are slower on Windows than on linux

ryan.m.lewkowicz · June 26, 2019, 10:39am

I have opened 2 bugs, which have been confirmed.

Pytorch is slower on windows than on linux

opened 07:28PM - 21 Jun 19 UTC

module: windows feature triaged

## 🐛 Bug Pytorch is slower on windows than on linux ## To Reproduce I u…sed this: https://github.com/ultralytics/yolov3 Try it on Linux, then try it on windows. It's about 2-5 times slower. ## Expected behavior Performance should be exactly the same. ## Environment Please copy and paste the output from our [environment collection script](https://raw.githubusercontent.com/pytorch/pytorch/master/torch/utils/collect_env.py) (or fill out the checklist below manually). You can get the script and run it with: PyTorch version: 1.1.0 Is debug build: No CUDA used to build PyTorch: 10.0 OS: Microsoft Windows Server 2019 Datacenter Evaluation GCC version: Could not collect CMake version: Could not collect Python version: 3.6 Is CUDA available: Yes CUDA runtime version: 10.0.130 GPU models and configuration: GPU 0: GeForce RTX 2080 Ti GPU 1: GeForce GTX 1080 Nvidia driver version: 430.86 cuDNN version: C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.0\bin\cudnn64_7.dll Versions of relevant libraries: [pip] numpy==1.16.4 [pip] numpydoc==0.9.1 [pip] torch==1.1.0 [pip] torchvision==0.3.0 [conda] blas 1.0 mkl [conda] libblas 3.8.0 10_mkl conda-forge [conda] libcblas 3.8.0 10_mkl conda-forge [conda] mkl 2019.3 203 [conda] mkl-service 2.0.2 py36hfa6e2cd_0 conda-forge [conda] mkl_fft 1.0.13 py36hfa6e2cd_1 conda-forge [conda] mkl_random 1.0.4 py36h830ac7b_0 conda-forge [conda] pytorch 1.1.0 py3.6_cuda100_cudnn7_1 pytorch [conda] torchvision 0.3.0 py36_cu100_1 pytorch ## Additional context Tensorflow has this issue too. It's my understanding that windows has a buffer of some sort that batches calls to the gpu. Now you'd imagine at the lower level, that's hopefully getting bypassed, but who knows. https://github.com/tensorflow/tensorflow/issues/29874 If you have a card that supports TCC, you should enable it and see if it makes a difference

github.com/tensorflow/tensorflow

I think Windows performance might be poor due to WDDM? tensorflow-gpu

opened 11:58AM - 17 Jun 19 UTC

closed 01:03PM - 07 Jan 22 UTC

rlewkowicz

stat:awaiting response type:bug stalled comp:model TF 2.0

So personally I have very poor performance on windows when it comes to object de…tection. Doesn't matter if I build from source, pip, conda etc, it's about 5-10x slower on windows vs linux. I see sporadic reports across tensorflow and pytorch stating similar things. Some bugs confirmed, but with no follow up. I'm hoping someone on this project has access to a non consumer nvidia card. I think that's the titan series. None of the geforce cards will work (thanks nvidia). I ask you to do any of the object detection api tutorials, or even any custom implementation. Test it on linux, then the same on windows. I'm hyper confident you will find a significant discrepancy between the two. Then enable TCC on your non consumer card, and see if the performance then becomes similar. I'm basing my assumption on this: https://stackoverflow.com/questions/19944429/cuda-performance-penalty-when-running-in-windows

I think it’s because of wddm. Of course I’m having difficulty finding information on which cards support tcc. Like if I go buy a cheap p400 will that support it? It pisses me off I can’t do it on my 1400 dollar 2080ti.

saulocpp · July 1, 2019, 11:08am

One single search for “nvidia cards that support tcc” brings:

[url]https://devtalk.nvidia.com/default/topic/959472/how-to-know-what-cards-allow-tcc-mode-/[/url] (2016)
[url]c++ - Does the CUDA TCC driver work with geforce cards on windows? - Stack Overflow (2013)

ryan.m.lewkowicz · July 1, 2019, 11:41am

Does it though? Did you actually read anything you linked? The laptop one is just completely unrelated, then in the stackoverflow post they say Quadro cards, which is why I asked about the p400. But who knows?

Most companies publish a compatibility matrix for their products. I shouldn’t have to rely on random people on stack overflow.

saulocpp · July 1, 2019, 1:52pm

Did you read the first post? It tells exactly what class of cards support TCC.
I can see people are more and more ignoring childish posts, such as this one and your other one ([url]https://devtalk.nvidia.com/default/topic/1055898/cuda-setup-and-installation/i-don-t-see-general-discussion-so-i-m-posting-this-here-pci-passthrough-frustrations-/post/5353164/#5353164[/url]), plus some other cross-posts I’ve seen here and there about this P400.
I think I should just do the same…

ryan.m.lewkowicz · July 1, 2019, 2:07pm

I posted a legitimate a bug in this one and expressed my frustrations with continued anti-consumer practices. It’s not childish, there’s literally no other venue to express your frustrations at arbitrary decisions. The fact that people can flash custom firmwares in older cards and modify drivers with newer ones clearly show it’s not a hardware limitation.

“TCC mode should be available for Tesla GPUs, most Quadro desktop GPUs, and GeForce Titan family”

I’m sorry I don’t want to spend my money to test a bug theory on some guys forum post where he says it should work. How about some official documentation?

Robert_Crovella · July 1, 2019, 2:27pm

There isn’t a chart anywhere published by NVIDIA that details which cards support TCC and which don’t.

If you want to see a change to CUDA, whether that be performance, behavior, or documentation, I suggest filing a bug. The directions are linked at the top of this forum in a sticky post.

In order to set expectations, NVIDIA works on things according to their own priority. Not all bugs filed get prosecuted to the same extent.

HannesF99 · July 2, 2019, 9:27am

As a rough rule, can usually expect all Quadro cards of a generation equal or higher to Quadro 4000 (so e.g. for currrent ‘Turing’ generation the Quadro RTX 4000 / 5000 / 6000 / 8000) support TCC. That is at least our experience so far. I don’t think that a Quadro P400 supports TCC.

The decision which cards get TCC is I suppose more one of market segmentation (and testing, support & qualification) than of hardware limitations. The same applies for double-precision capabilities etc. Any company is of course free to segment the market as they see useful - e.g. Intel does the same with Core / Xeon.

It seems that Pytorch is not very optimized for the specifics of WDDM (avoid short kernel launches, avoid repeated memory allocations). Regarding YoloV3, it might be better to switch to the ‘darknet’ framework (GitHub - pjreddie/darknet: Convolutional Neural Networks), which provides fast inference (uses CUDNN internally) also on windows.

njuffa · July 2, 2019, 10:55am

My own observations match (or at least: do not contradict) HannesF99’s rule of thumb as far as TCC support is concerned. As with any rule of thumb, there can be no guarantees.

It might possibly help all CUDA users stuck with WDDM (for whatever reason) if as many people as possible complain to Microsoft about the poor performance of the WDDM driver model, for example by pointing out that it is not performance-competitive with the Linux diver model.

Topic		Replies	Views
PyTorch utilize CPU instead of GPU CUDA on Windows Subsystem for Linux	5	2874	November 25, 2020
why cudaGetDeviceProperties and cudaMallocPitch consume a lot of time CUDA Programming and Performance	18	2369	January 9, 2017
Cpp pytorch inference OpenGL tensorrt , cuda , tensorflow , nvbugs	8	1342	June 27, 2023
Simple CUDA program hitting size limits/errors on Windows but not Linux CUDA Programming and Performance	23	1948	January 12, 2019
cudaMalloc(Pitch) _significantly_ slower on windows with Geforce drivers > 350.12 CUDA Programming and Performance	10	2557	February 10, 2017
CUDA on Windows much slower than on linux CUDA Programming and Performance	5	3522	January 26, 2013
Tesla K40 vs. Quadro M6000 vs. GeForce Titan X CUDA Programming and Performance	12	45385	April 7, 2015
Which GPU for best performance with TCC and CUDA cores (no tensors) CUDA Programming and Performance	30	448	December 6, 2024
Is there anyone know about the performance at linux and windows? CUDA Programming and Performance	4	997	November 2, 2012
One GPU NOT capable of Peer-to-Peer (P2P) CUDA Programming and Performance	22	5129	November 27, 2018

TF and Pytorch are slower on Windows than on linux

Related topics