Second GPU is lost (nvidia-smi) after seconds/minutes in Ubuntu

Hello all,

I have a problem with my second GPU, and Googleing has brought me here. In short: my second water cooled GTX 1080 is lost by Ubuntu nvidia-smi (or at least marked as lost) after either seconds or minutes in the Ubuntu desktop.

What I have already done:

  • reseed both cards
  • replug the pci power cables (I found that the bottom one was not plugged in all the way)

The card is not overheating as nvidia-smi is reporting around 20 degrees Celcius. I was wondering if something else (like the memory or power delivery phases) could overheat.

I believe this is not a novel problem, but I have no idea what else to tell you, so please let me know. I went through the nvidia-bug-report.log, but I could not find anything interesting on my own. Any help would be appreciated.

bug-report.gz (252.9 KB)

edit: my machine did run (several days or weeks) with the power cable of the second card not plugged in all the way. I did not notice when the second GPU failed since I wasn’t using it in this period

Did you try swapping power cables between both gpus?

1 Like

I did not, thanks for your suggestion. However, when I tried to load the GPU again today, it was totally fine. My best guess is that the unstable power delivery (due to the loose cable) made the entire device unstable. I think that the fact that the power cable was plugged in properly, and the fact that it sat idle for a couple of days, restabilized it.

Thanks for your help anyway!