GeForce GTX 580 giving NANs while Tesla C2050 giving correct output.

colio303 · March 14, 2013, 2:35pm

I’m working on a project using CUDA 3.2 that is being developed on Windows 7 using Visual Studio. I am having this weird problem using a GeForce GTX 580 and a Tesla C2050 (each on different machines) where the Tesla GPU will give correct floating point numbers back to the code, and the 580 GPU will give back a series of NANs (Not A Number).

Note:

The GPU's are being used more for parallel computing power rather than for graphics.
The code on each machine is identical since the project folder was just copied over.
The code did not receive correct floating point numbers until it was moved from the 580 GPU machine to the Tesla GPU machine.

Does anyone know how this problem could be caused by the difference in graphics cards?

pasoleatis · March 14, 2013, 4:49pm

Hello,

Some gtx cards are overclocked, which can increase the amount of memory random flips. It depends on the producer. The Tesla card has ECC. I recommend running the debugger , maybe there is some error in the code.

njuffa · March 14, 2013, 5:59pm

(1) Does the code check the status of every CUDA API call, and every kernel launch?
(2) Does the code run cleanly through cuda-memcheck?

A likely failure scenario is that an allocation is failing, or that there is a timeout on one of the kernels, causing the kernel to be terminated by the watchdog timer. Many other software failure scenarios are possible, such as race-conditions in device code or errors in the host code (have you tried valgrind?).

pasoleatis · March 14, 2013, 6:45pm

It is also possible that the there is just a simple thing like out of bounds access, which can give different results depending on the compiler version or cuda toolkit so I would check also that the results on Tesla card are correct.

colio303 · March 14, 2013, 7:34pm

The results from the Tesla card are correct. They match what is expected when the program is run.

colio303 · March 14, 2013, 7:38pm

I guess I should also note that the behaviour of the program on the Tesla machine is quite stable, while the results from the 580 machine are quite erratic. Almost every time I run the program it seems to change what it puts out. Sometimes it puts out nothing but NANs, while other times I get what seems to be a right result, but ends up being very wrong.

njuffa · March 14, 2013, 8:33pm

This is a reasonable indication that “random data” (due to out of bounds access, unintialized data, race condition) is being picked up.

Have you had a chance to follow up on the two check items I listed above? The only way to get to the bottom of such issues is to systematically eliminate likely causes, starting with the most likely ones. cuda-memcheck also supports checking for certain race conditions in the latest version, although I think this functionality may not be supported on all platforms due to hardware limitaqtions. valgrind, or some equivalent tool on Windows, can tell you about uninitialized data and out-of-bounds accesses in the host portion of the code.

Make sure to use the latest CUDA software and a recent driver. To give an idea where drivers are at the moment, my recently updated 64-bit Win7 system here reports running driver version 311.35.

Topic		Replies	Views
help CUDA Programming and Performance	0	4326	May 21, 2010
Different results on tesla s1070 and GTX8800 why? CUDA Programming and Performance	4	2651	February 4, 2010
random output running code on Fermi card CUDA Programming and Performance	5	1010	December 16, 2011
Tesla C1060 and GTX 275 running same code on two platforms CUDA Programming and Performance	3	1450	November 30, 2009
Double-precision scientific computing code on GeForce GTX580 CUDA Programming and Performance	1	2427	March 5, 2012
Tesla C2050 slower than GeForce 8800? CUDA Programming and Performance	14	21159	April 20, 2011
unspecified launch failures on GTX580 but GTX480 CUDA Programming and Performance	2	6027	January 11, 2011
GPU in state where results are not reproducible! CUDA Programming and Performance	50	17429	November 2, 2012
same code gives different results on two Nvidia 2080Ti GPU CUDA Programming and Performance	7	1579	November 2, 2019
Different results between GPU and CPU different when program runs on Tesla card and same results wh CUDA Programming and Performance	7	1151	October 8, 2010

GeForce GTX 580 giving NANs while Tesla C2050 giving correct output.

Related topics