CUDA precision of desktop GPU

Hi all,
I have a question concerning the percision/stability of desktop GPGPU.
I do have a GTX 560 Ti, I use CUDA to train Multi-Layer Perceptrons.
Typically a single precission is fine for me. I know that GPU float computation
is negligibly less precise than the CPU one, but this has never been a problem,
if the precision stays the same over time.

I tried to test the stability of the computation, in a way that I was running
a matrix multiplication example “matrixMul” from the 4.1.28 SDK in a loop.
This tool does a cross check with the CPU implemented, so I can detect if
the GPU starts to compute something strange.

So I ran the test, and after 8 days of full load, the precision started to decrease:
Listing first 100 Differences > 0.000010…

Row 0:
Loc(0,0) CPU=163.09207 GPU=163.09212 Diff=0.000046
Loc(2,0) CPU=168.34337 GPU=168.34328 Diff=0.000092
Loc(3,0) CPU=156.45810 GPU=156.45802 Diff=0.000076
Loc(4,0) CPU=162.84628 GPU=162.84631 Diff=0.000031
Loc(5,0) CPU=161.11246 GPU=161.11253 Diff=0.000076
Loc(6,0) CPU=164.38638 GPU=164.38628 Diff=0.000107

My question is, is this normal? Or is it a flaw that can it be used
as reason to replace the card with another? Also is it sure that this
does kind of problems does no not occur if I will use tesla GPUs?

Is there some optional automatical mechanism that will ensure that the
output of GPGPU computation is accurate? (It is a bit scary not to be
able to trust and reconstruct the results due to time varying
precision of the GPU)

Thank you in advance for any comments/suggestions,
Karel

Tesla products are designed for reliable, continuous, high-performance computation. Please refer to the following webpage: http://www.nvidia.com/object/why-choose-tesla.html

Hello,

At this point the gpu have native support for float and double preicision. On the 2.0 the double precision is 8 times slower than float prrecision while on 2.1 is 12 times slower. I think on 3.0 the it 24 times, while on tesla is only 2 times! The “accurate result” depends on what you need and what you use. If double precision is enough than the results from gpu will be “accurate” id you use double precision. By comparison the cpu does everything internally in 80 bit precision which one would call long double.

Hello,
thank you for the message, the differences are interesting. In my case it is sufficient to use 32-bit float precision. But what I have observed is that the precision varies over time if the GPU is used intensively for a long time. Moreover the GPU does not autodetect this problem, and shows correct run. Which is unexpected and a bit disappointing…

nvcc gives:
Error limit reached.
100 errors detected in the compilation of /tmp/tmpxft_000039d5_00000000-6_test_nvcc_many_kernels_together.cpp1.ii
I want to get all the errors listed.
What is the command line option to increase the error_limit above 100?
Thank you
Bill
http://www.cs.ucl.ac.uk/staff/W.Langdon/

I forgot to mention that the memory correction is mising on the GeForce cards. If your code runs for so long it is possible that 1 bit is flipped (chages values) randomly.

Have you looked at the temperature of the card? The rate of bit errors can go up suddenly if the card gets too hot. I’ve seen reports of problems showing up at 90C and higher.

You might also want to run a GPU memory tester (similar to memtest86). Here is one such program:

This will test if there are permanently stuck bits at some memory locations. This could produce an apparent time variation if allocates and deallocates GPU memory as it runs, and it takes the memory allocator some amount of time to reach the bad section of memory.

It appears that adding --compiler-options -fmax-errors=0 to the nvcc command line will
make it report all the errors. However under unix this needs gcc compiler version
later than 4.6 and I have not upgraded gcc yet have yet to test it.
Bill

The data shown in the original post is completely inconclusive. There are differences between the host and the GPU results, but one cannot tell which ones are more correct. One can also not tell whether any of the differences are due to something “bad” happening on the GPU. Further data would be required to support that hypothesis. Alternative hypotheses are equally likely, in my experience. For example, there could be faulty system memory in the host.

There are a few things to check for in a consumer-grade PC (independent of GPUs).

(1) Is there an adequate power supply? Local “brown-outs” can lead to faulty computations, as transistor speed descreases with voltage.

(2) Is there adequate air flow / cooling? This is not just a question of fans but also of unobstructed air flow inside the enclosure. Electronic components become less reliable as temperature increases.

(3) Do all connectors make good contact (CPU socket, DIMM sockets, PCIe pluging cards)? Factors such as vibration can negatively affect reliability of connectors.

(4) Are any components (in particular, CPUs or GPUs) overclocked? All electronic circuits have a maximum frequency at which they operate realiably. In some cases components are “factory overclocked”, i.e. the frequencies of the shipping product exceed the reference frequencies established by the processor vendor. The reference clock settings for GPUs can be found on the NVIDIA website, for the GTX 560Ti see http://www.nvidia.com/object/product-geforce-gtx-560ti-gtx-550ti-us.html

(5) Does the memory (system memory, GPU memory) have ECC protection? All memory is subject to exposure to “cosmic rays” that can cause random bit flips. The larger the memory and the longer the period of observation, the more likely is one to encounter such an event. Only ECC-protected memory can detect (and sometimes, correct) such transient errors. Any one-bit difference can spread to other data via subsequent computation.

Yes your test is completely inconclusive as we don’t have a perfect reference, or do we? Either (A) but not (B) is wrong or both (A) && (B) are wrong at the same time.

I’ve been involved with projects where people have been looking for errors on the GPU where it was actually reason number 5 that njuffa listed above. Given a large enough memory over time it becomes statistically likely for non-ECC systems to be influenced.

Conclusion : stick to professional hardware in deployments, if verifying on consumer CPU / GPU motherboards you need to take into account potential electromechanical issues.