NVIDIA RTX A5000 randomly becomes unusable and has high idle power draw

I just received a replacement RTX A5000 (the previous one was RMA’d due to ROM corruption) which seems likely defective…

The replacement unit sits at very high idle power draw (80W), and randomly becomes unusable after 0-15 minutes. Rebooting makes it reappear, only for it to vanish later again; occasionally, the card will even be unusable on boot.

More specifically, when in this unusable state, it is still listed by lspci, but nvidia-smi does not report it; it is not listed as usable by PyTorch either). In one occurence, after sitting in this broken idle state for a few minutes, the fans started running at full throttle — at a much higher speed than I had ever heard the previous GPU under heavy workloads; it sounded and felt exactly like a very angry hairdryer.

I have tried reseating the card twice to no avail. The rest of the setup is identical to one which proved stable over multiple months before the ROM corruption issue. I happen to have an RTX A4000 on hand at the moment which behaves perfectly normally in this same setup. I have also tried reinstalling drivers; this had no effect.

I have attached the output of the debug script here. Hopefully it is possible to see if it is indeed defective, or if this can be fixed through software. This is getting frustrating…

nvidia-bug-report.log.gz (149.2 KB)

Adding more info to this: the following messages show up in dmesg. These seem likely related to the issue as other people have reported unrecognized GPUs whenever RmInitAdapter failed! shows up in the log.

[Wed Jun  8 20:31:30 2022] resource sanity check: requesting [mem 0x000c0000-0x000fffff], which spans more than PCI Bus 0000:00 [mem 0x000a0000-0x000dffff window]
[Wed Jun  8 20:31:30 2022] caller os_map_kernel_space.part.0+0x82/0xb0 [nvidia] mapping multiple BARs
[Wed Jun  8 20:31:30 2022] NVRM: GPU 0000:3e:00.0: RmInitAdapter failed! (0x24:0x72:1417)
[Wed Jun  8 20:31:30 2022] NVRM: GPU 0000:3e:00.0: rm_init_adapter failed, device minor number 0
[Wed Jun  8 20:31:30 2022] resource sanity check: requesting [mem 0x000c0000-0x000fffff], which spans more than PCI Bus 0000:00 [mem 0x000a0000-0x000dffff window]
[Wed Jun  8 20:31:30 2022] caller os_map_kernel_space.part.0+0x82/0xb0 [nvidia] mapping multiple BARs
[Wed Jun  8 20:31:31 2022] NVRM: GPU 0000:3e:00.0: RmInitAdapter failed! (0x24:0x72:1417)
[Wed Jun  8 20:31:31 2022] NVRM: GPU 0000:3e:00.0: rm_init_adapter failed, device minor number 0

This thread shows an issue with a very similar set of error messages. I am using Ubuntu 20.04.3 LTS (GNU/Linux 5.13.0-35-generic x86_64) however — this is headless through SSH.

The card is also unusable on Windows 10; it falls back to VGA graphics mode with error 43. Additionally, the following state information is reported by Windows device manager:

  • PCI error reporting - 00000007
  • PCI uncorrectable error severity - 00462030

PNY is willing to go ahead with an RMA, but if this is indeed a software issue I would rather spend the time debugging it than losing access to compute hardware for another two months.

After discussing this with some knowledgable colleagues and not getting anywhere further here, I have gone ahead with the RMA. This is just disappointing.