same code gives different results on two Nvidia 2080Ti GPU

I have two same 2080Ti installed on one desktop for general computing purposes and my code is written using cublas library. The machine is only a couple of weeks old. Everything looks fine until three days ago, when I got error running the code on one of the card. Then I switch to another card by using cudaSetDevice(), and run the same code. No error happens.

What could be the reasons for that problem? Did I break one card?

three things come to mind here

  1. There is the possibility that your code does not initialize allocated memory to a known state.

Depending on the state the card’s memory was in previously, this can result in different computation results. However this kind of problem could manifest itself randomly on any GPU, not just on the single device like you observed.

try running the CUDA examples from the CUDA toolkit on the affected GPU to see if the builtin self tests return successfully.

you may be able to set the environment variable CUDA_VISIBLE_DEVICES before running the examples, so you do not have to modify code to force it to use specific device IDs

  1. If one of the devices is driving a display, the so called watchdog timer may interrupt long running computations in order to prevent a permanent freeze of the display. If the application does not explicitly check for kernel launch errors, then such occurences may look like miscomputations (as the kernel execution did not complete). In such cases it may help to install another (cheap, non-CUDA) GPU just to drive the display.

  2. If the installed cards are factory overclocked, try configuring the cards to default clocks. Sometimes cards can miscompute if they are running above stock clocks. This may not affect gaming work loads noticeably, but in CUDA the results may get corrupted.

Christian

One other possibility is to clean the device from static electricity. One thing that helped me was to fully shutdown + switch off the PSU for at least 15 seconds.

If it does not help, you can open the computer, take the buggy GPU and gently touch with a cable connected to the ground (or flowing water), in order to remove completely remainings of electricity.

Flowing water? I am actually surprised you did not recommend to put the card into a microwave oven for proper exorcism.

Haha ! Sorry for the quiproquo, water does absorbs charges and replaces the earth; of course you should not put the card under the water, only the cable, and the other part of the cable on miscellaneous parts of the card.

Here‘s JayZ2Cents spraying a poor 1650 graphics card with water while in operation.

https://youtu.be/iJUl_IqDbNA

Spoiler: it dies

grabs GPU firmly with both hands

“I cast thee out, foul demon!”

As for the water spraying / rinsing method, I think it is very likely that both GPUs would behave identically after this treatment, because they both would be bricks (i.e. non-functional, dead, pushing up the daisies, …, EX-GPUs).

Seriously now, in case of mysterious differences:

(1) Add proper error checking to all API calls and all kernel invocations
(2) Check code for: Accesses out of bounds, uninitialized data, race conditions
(3) Check code for: Non-deterministic code, such as atomic floating-point adds
(4) Run code under control of cuda-memcheck and fix all all issues reported

The reason for these recommendations is that most instances of mysterious differences in code behavior are due to software.

There are other treatments I believe you can apply to your cards if they are still misbehaving, such as:
https://youtu.be/zQtY0S06AWg