My RTX 3070 was running fine until it stopped being detected on Monday.
I am running Ubuntu 20.04 on a Lenovo T480 thinkpad, the GPU is connected to my laptop using
a thunderbolt 3 connection to a Razor Core X external GPU enclosure. The enclosure works fine
with other GPUs and the problem with my RTX 3070 persists when using a different Laptop, running Manjaro 20.
Here are the outputs from dmesg and lspci:
Output from dmesg | grep NVRM:
christoph@t480 ~ dmesg | grep NVRM Wed 31 Mar 2021 13:26:27 BST [240534.560633] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 460.32.03 Sun Dec 27 19:00:34 UTC 2020 [240536.050245] NVRM: GPU 0000:09:00.0: RmInitAdapter failed! (0x26:0xffff:1290) [240536.050377] NVRM: GPU 0000:09:00.0: rm_init_adapter failed, device minor number 0 [240536.115191] NVRM: GPU 0000:09:00.0: RmInitAdapter failed! (0x24:0xffff:1248) [240536.115316] NVRM: GPU 0000:09:00.0: rm_init_adapter failed, device minor number 0 [240540.221966] NVRM: GPU 0000:09:00.0: RmInitAdapter failed! (0x24:0xffff:1248) [240540.222009] NVRM: GPU 0000:09:00.0: rm_init_adapter failed, device minor number 0 [240540.256522] NVRM: GPU 0000:09:00.0: RmInitAdapter failed! (0x24:0xffff:1248) [240540.256633] NVRM: GPU 0000:09:00.0: rm_init_adapter failed, device minor number 0 [240576.531056] NVRM: GPU 0000:09:00.0: RmInitAdapter failed! (0x24:0xffff:1248) [240576.531183] NVRM: GPU 0000:09:00.0: rm_init_adapter failed, device minor number 0 [240576.566537] NVRM: GPU 0000:09:00.0: RmInitAdapter failed! (0x24:0xffff:1248) [240576.566583] NVRM: GPU 0000:09:00.0: rm_init_adapter failed, device minor number 0 [240581.934534] NVRM: GPU 0000:09:00.0: RmInitAdapter failed! (0x24:0xffff:1248)
Output from lspci -vvv | grep -i nvidia:
christoph@t480 ~ lspci -vvv | grep -i nvidia Wed 31 Mar 2021 13:26:28 BST 09:00.0 VGA compatible controller: NVIDIA Corporation Device 2484 (rev a1) (prog-if 00 [VGA controller]) Subsystem: NVIDIA Corporation Device 146b Kernel driver in use: nvidia Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia 09:00.1 Audio device: NVIDIA Corporation Device 228b (rev a1) Subsystem: NVIDIA Corporation Device 146b
The system still detects an nvidia device, but the card is no longer reporting a model number to lspci.
The hard has been used very actively for about two months running pytorch.
Please let me know if there are any further troubleshooting steps I can try, or otherwise if I can send in the card for repair.
Kindest regards,
Christoph