Hello all,
This issue has me completely stumped. I have scoured the internet and tried everything that seemed to work for people having a similar issue but to no avail.
In short, 2/6 of the graphics cards I have installed in my ETH mining server cannot be detected by nvidia-smi. One of them was working just fine on the same PCIe port until I installed a new card on a different port. Now that one is fine, but the old one is having problems. The 6th and final card and port have not been confirmed functional.
lspci shows all cards:
01:00.0 VGA compatible controller: NVIDIA Corporation GP106 [GeForce GTX 1060 6GB] (rev a1)
02:00.0 VGA compatible controller: NVIDIA Corporation GP106 [GeForce GTX 1060 6GB] (rev a1)
04:00.0 VGA compatible controller: NVIDIA Corporation GP106 [GeForce GTX 1060 6GB] (rev a1)
05:00.0 VGA compatible controller: NVIDIA Corporation GP106 [GeForce GTX 1060 6GB] (rev a1)
07:00.0 VGA compatible controller: NVIDIA Corporation GP106 [GeForce GTX 1060 6GB] (rev a1)
08:00.0 VGA compatible controller: NVIDIA Corporation GP106 [GeForce GTX 1060 6GB] (rev a1)
But dmesg shows a failure of RmInitAdapter from NVRM:
[ 75.298825] NVRM: GPU 0000:07:00.0: RmInitAdapter failed! (0x23:0xffff:624)
[ 75.298873] NVRM: GPU 0000:07:00.0: rm_init_adapter failed, device minor number 4
[ 75.415957] NVRM: GPU 0000:08:00.0: RmInitAdapter failed! (0x23:0xffff:624)
[ 75.416004] NVRM: GPU 0000:08:00.0: rm_init_adapter failed, device minor number 5
Output of nvidia-smi:
±----------------------------------------------------------------------------+
| NVIDIA-SMI 460.39 Driver Version: 460.39 CUDA Version: 11.2 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce GTX 106… On | 00000000:01:00.0 Off | N/A |
| 59% 74C P2 101W / 120W | 4352MiB / 6076MiB | 100% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 1 GeForce GTX 106… On | 00000000:02:00.0 Off | N/A |
| 50% 74C P2 94W / 120W | 4345MiB / 6078MiB | 100% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 2 GeForce GTX 106… On | 00000000:04:00.0 Off | N/A |
| 39% 72C P2 92W / 120W | 4345MiB / 6078MiB | 100% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 3 GeForce GTX 106… On | 00000000:05:00.0 Off | N/A |
| 43% 72C P2 94W / 120W | 4345MiB / 6078MiB | 97% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1037 G /usr/lib/xorg/Xorg 8MiB |
| 0 N/A N/A 1116 G /usr/bin/gnome-shell 1MiB |
| 0 N/A N/A 1446 C ethminer 4337MiB |
| 1 N/A N/A 1037 G /usr/lib/xorg/Xorg 4MiB |
| 1 N/A N/A 1446 C ethminer 4337MiB |
| 2 N/A N/A 1037 G /usr/lib/xorg/Xorg 4MiB |
| 2 N/A N/A 1446 C ethminer 4337MiB |
| 3 N/A N/A 1037 G /usr/lib/xorg/Xorg 4MiB |
| 3 N/A N/A 1446 C ethminer 4337MiB |
±----------------------------------------------------------------------------+
I have already tried running nvidia-persistanced on boot without success. I am almost certain it is not a hardware issue, at least for the card that was working on the same PCIe port. I doubt it is a kernel configuration issue (I haven’t touched my kernel.) Does anyone have any ideas?
I am running an almost fresh install of Ubuntu Server 20.04.2 LTS. Here is my bug report: nvidia-bug-report.log.gz (881.0 KB)
Thanks in advance!