Discrepancy Between Performance In Windows Visual Studio vs Linux Linux: 130+ GFLOPS vs. 4+ GFLOPS i

I have two FFT codes that perform extremely different. On the following setup:

Windows 7 64 Bit
NVIDIA Geforce 560 TI
Visual Studio w/ Parallel NSight
FFT Performance: 4+ GFLOPS

Linux 64 Bit
NVIDIA Tesla C2050
FFT Performance: 120+ GFLOPS

Both codes are compiled similarly and areidentical. Why the significant difference in GFLOPS? I understand the Tesla C2050 is a workstation/performance oriented card, but I should still be able to get decent GFLOP on the Geforce card. The files are attached; for Windows you can open the Visual Studio solution. For Linux, you can build using the attached Makefile.
FFT_LITE_LINUX.zip (27.6 KB)
FFT_LITE.zip (2.54 MB)

There is a factor of 4 in the double precision perfomance between gaming cards and Tesla cards. This difference is per core. There is also a difference in core count. The 560 is a 2.1 compute so in practice the 560 TI with 384 cores is losing 1.5 performance if the FFT is not optimized for 2.1 it will have 256 core used. The Tesla card has 448 cores.

FFT is both computationally and memory bounds so also the memory bandwidth is important, but I do not know how big is the difference between them.

Of course the difference in cores count and double precision units would make up to 10 times maybe, I did not check the core speed which might be bigger for 560 card. The rest must be form the memory and the Windows stuff.

Have you tried swapping the cards to put the Tesla on the Windows machine and the GeForce on the Linux machine? That may be a bit of a hassle, but would be good for making comparisons. There is a driver for the Tesla card which has a way to make the card not have the issues related to WDDM on Windows.

I installed CUDA on Ubuntu 11.10 with the Geforce 560 Ti card. I am now getting 107+ GFLOPS for the FFT_LITE project that I prototyped.

I’m not sure what the discrepancy is but there might be something faulty in my Windows solution that’s preventing me from getting the proper GFLOPS that I expect. Thanks for the advise

–EDIT: After further investigation, I performed compilation from Windows command line. Needless to say, nvcc compiled FFT_LITE ran the 106+ GFLOPS that I was expecting. There is probably an optimization phase in Visual Studio that is hindering my Visual Studio implementation.

TL;DR: Something wrong with Visual Studio configuration causing ~4 GFLOPS when true performance is 106 GFLOPS. Running on Linux or compiling in Windows command prompt using nvcc proved true performance of code @ 106 GFLOPS

I managed to get 96 GFLOPS (from 4.7 GFLOPS) when compiling with Visual Studio by telling it not to include debug information.

Right button over kernel.cu, select Configuration Properties|CUDA C/C++|Device in the left pane and choose “No” in the right pane where it says “Generate GPU Debug Information”.