same code gives different results on two Nvidia 2080Ti GPU

zhenhua3 · October 22, 2019, 2:16pm

I have two same 2080Ti installed on one desktop for general computing purposes and my code is written using cublas library. The machine is only a couple of weeks old. Everything looks fine until three days ago, when I got error running the code on one of the card. Then I switch to another card by using cudaSetDevice(), and run the same code. No error happens.

What could be the reasons for that problem? Did I break one card?

cbuchner1 · October 24, 2019, 9:39am

three things come to mind here

There is the possibility that your code does not initialize allocated memory to a known state.

Depending on the state the card’s memory was in previously, this can result in different computation results. However this kind of problem could manifest itself randomly on any GPU, not just on the single device like you observed.

try running the CUDA examples from the CUDA toolkit on the affected GPU to see if the builtin self tests return successfully.

you may be able to set the environment variable CUDA_VISIBLE_DEVICES before running the examples, so you do not have to modify code to force it to use specific device IDs

If one of the devices is driving a display, the so called watchdog timer may interrupt long running computations in order to prevent a permanent freeze of the display. If the application does not explicitly check for kernel launch errors, then such occurences may look like miscomputations (as the kernel execution did not complete). In such cases it may help to install another (cheap, non-CUDA) GPU just to drive the display.
If the installed cards are factory overclocked, try configuring the cards to default clocks. Sometimes cards can miscompute if they are running above stock clocks. This may not affect gaming work loads noticeably, but in CUDA the results may get corrupted.

Christian

Gatchan · November 1, 2019, 3:45pm

One other possibility is to clean the device from static electricity. One thing that helped me was to fully shutdown + switch off the PSU for at least 15 seconds.

If it does not help, you can open the computer, take the buggy GPU and gently touch with a cable connected to the ground (or flowing water), in order to remove completely remainings of electricity.

cbuchner1 · November 1, 2019, 6:15pm

External Media

Flowing water? I am actually surprised you did not recommend to put the card into a microwave oven for proper exorcism.

Gatchan · November 1, 2019, 9:57pm

Haha ! Sorry for the quiproquo, water does absorbs charges and replaces the earth; of course you should not put the card under the water, only the cable, and the other part of the cable on miscellaneous parts of the card.

cbuchner1 · November 1, 2019, 10:02pm

Here‘s JayZ2Cents spraying a poor 1650 graphics card with water while in operation.

[url]I killed a GPU for science... - YouTube

Spoiler: it dies

njuffa · November 1, 2019, 10:43pm

grabs GPU firmly with both hands

“I cast thee out, foul demon!”

As for the water spraying / rinsing method, I think it is very likely that both GPUs would behave identically after this treatment, because they both would be bricks (i.e. non-functional, dead, pushing up the daisies, …, EX-GPUs).

Seriously now, in case of mysterious differences:

(1) Add proper error checking to all API calls and all kernel invocations
(2) Check code for: Accesses out of bounds, uninitialized data, race conditions
(3) Check code for: Non-deterministic code, such as atomic floating-point adds
(4) Run code under control of cuda-memcheck and fix all all issues reported

The reason for these recommendations is that most instances of mysterious differences in code behavior are due to software.

saulocpp · November 2, 2019, 4:48pm

There are other treatments I believe you can apply to your cards if they are still misbehaving, such as:
[url]Cuando tu mecánico no le encuentra la Falla 😓 😂😂😂 funny - YouTube

Topic		Replies	Views
Different performance from different GPUs with Identical Code CUDA Programming and Performance	18	4633	April 11, 2012
Different results with the same algorithm [GeForce GTX780Ti] CUDA Programming and Performance	1	754	June 27, 2014
Running code on different compute devices (each with different behavior) CUDA Setup and Installation	1	515	December 22, 2016
GPU in state where results are not reproducible! CUDA Programming and Performance	50	17429	November 2, 2012
running two CUDA processes CUDA Programming and Performance	1	939	June 2, 2015
Strange behaviour on GTX295 Random data changes in GPU memory CUDA Programming and Performance	7	7538	September 11, 2011
Correct on Device 0, Incorrect on others CUDA Programming and Performance	1	1322	July 21, 2009
Two 8800 GTX cards with Intel Core 2 Duo would this work? CUDA Programming and Performance	19	13318	October 2, 2007
problem with double precision unpredictable results Different run give differents errors or no error CUDA Programming and Performance	12	2988	September 10, 2010
GeForce GTX 580 giving NANs while Tesla C2050 giving correct output. CUDA Programming and Performance	6	1082	March 14, 2013

same code gives different results on two Nvidia 2080Ti GPU

Related topics