GeForce GTX 580 giving NANs while Tesla C2050 giving correct output.

I’m working on a project using CUDA 3.2 that is being developed on Windows 7 using Visual Studio. I am having this weird problem using a GeForce GTX 580 and a Tesla C2050 (each on different machines) where the Tesla GPU will give correct floating point numbers back to the code, and the 580 GPU will give back a series of NANs (Not A Number).

Note:

  • The GPU's are being used more for parallel computing power rather than for graphics.
  • The code on each machine is identical since the project folder was just copied over.
  • The code did not receive correct floating point numbers until it was moved from the 580 GPU machine to the Tesla GPU machine.

Does anyone know how this problem could be caused by the difference in graphics cards?

Hello,

Some gtx cards are overclocked, which can increase the amount of memory random flips. It depends on the producer. The Tesla card has ECC. I recommend running the debugger , maybe there is some error in the code.

(1) Does the code check the status of every CUDA API call, and every kernel launch?
(2) Does the code run cleanly through cuda-memcheck?

A likely failure scenario is that an allocation is failing, or that there is a timeout on one of the kernels, causing the kernel to be terminated by the watchdog timer. Many other software failure scenarios are possible, such as race-conditions in device code or errors in the host code (have you tried valgrind?).

It is also possible that the there is just a simple thing like out of bounds access, which can give different results depending on the compiler version or cuda toolkit so I would check also that the results on Tesla card are correct.

The results from the Tesla card are correct. They match what is expected when the program is run.

I guess I should also note that the behaviour of the program on the Tesla machine is quite stable, while the results from the 580 machine are quite erratic. Almost every time I run the program it seems to change what it puts out. Sometimes it puts out nothing but NANs, while other times I get what seems to be a right result, but ends up being very wrong.

This is a reasonable indication that “random data” (due to out of bounds access, unintialized data, race condition) is being picked up.

Have you had a chance to follow up on the two check items I listed above? The only way to get to the bottom of such issues is to systematically eliminate likely causes, starting with the most likely ones. cuda-memcheck also supports checking for certain race conditions in the latest version, although I think this functionality may not be supported on all platforms due to hardware limitaqtions. valgrind, or some equivalent tool on Windows, can tell you about uninitialized data and out-of-bounds accesses in the host portion of the code.

Make sure to use the latest CUDA software and a recent driver. To give an idea where drivers are at the moment, my recently updated 64-bit Win7 system here reports running driver version 311.35.