Nvidia-smi recognizes on two GPU's instead of 3.

root@dl380-01:~# lspci |grep -i vga
01:00.1 VGA compatible controller: Matrox Electronics Systems Ltd. MGA G200eH3 (rev 02)
37:00.0 VGA compatible controller: NVIDIA Corporation TU102GL [Quadro RTX 6000/8000] (rev a1)
86:00.0 VGA compatible controller: NVIDIA Corporation TU102GL [Quadro RTX 6000/8000] (rev a1)
d8:00.0 VGA compatible controller: NVIDIA Corporation TU102GL [Quadro RTX 6000/8000] (rev a1)

OS - Ubuntu server 18.04

Fri Oct 25 14:22:33 2019
±----------------------------------------------------------------------------+
| NVIDIA-SMI 430.26 Driver Version: 430.26 CUDA Version: 10.2 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Quadro RTX 6000 Off | 00000000:37:00.0 Off | Off |
| 33% 36C P0 73W / 260W | 0MiB / 24220MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 1 Quadro RTX 6000 Off | 00000000:86:00.0 Off | Off |
| 34% 34C P0 58W / 260W | 0MiB / 24220MiB | 0% Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+
Can some one help me please?

nvidia-bug-report.log.gz (1.65 MB)

nvidia-bug-report.log.gz (1.69 MB)

Please run nvidia-bug-report.sh as root and attach the resulting .gz file to your post. Hovering the mouse over an existing post of yours will reveal a paperclip icon.
https://devtalk.nvidia.com/default/topic/1043347/announcements/attaching-files-to-forum-topics-posts/

Hello,

My apologies for not responding right away, please kindly share where I can find the script. I am currently running Ubuntu 18.04 Server. Please advise

Thanks,
Ishac

root@dl380-01:~# cat /etc/os-release
NAME=“Ubuntu”
VERSION=“18.04.3 LTS (Bionic Beaver)”
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME=“Ubuntu 18.04.3 LTS”
VERSION_ID=“18.04”
HOME_URL=“https://www.ubuntu.com/
SUPPORT_URL=“https://help.ubuntu.com/
BUG_REPORT_URL=“https://bugs.launchpad.net/ubuntu/
PRIVACY_POLICY_URL=“https://www.ubuntu.com/legal/terms-and-policies/privacy-policy
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic

It’s installed alongside the driver, just run
sudo nvidia-bug-report.sh

Got it, I just did.

Oct 29 11:49:35 dl380-01 kernel: NVRM: RmInitAdapter failed! (0x26:0xffff:1106)
Oct 29 11:49:35 dl380-01 kernel: NVRM: rm_init_adapter failed for device bearing minor number 2

Looks like faulty hw. Try reseating it, check it in another system, rma if issue persists.

Hello,

I have just reseated the card but still getting the same issue. The card is not detected by the OS.
Please see output of nvidia-smi

root@dl380-01:~# nvidia-smi
Tue Nov 5 13:20:24 2019
±----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.01 Driver Version: 418.87.01 CUDA Version: 10.1 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Quadro RTX 6000 On | 00000000:37:00.0 Off | Off |
| 34% 27C P8 18W / 260W | 0MiB / 24190MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 1 Quadro RTX 6000 On | 00000000:86:00.0 Off | Off |
| 35% 26C P8 5W / 260W | 0MiB / 24190MiB | 0% Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+

Hi Team,

Any update on this… Also, just to confirm, we tried different version of drivers as well, we could not detect the third gpu.

We are seeing following messages still.

[ 6.195925] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 430.26 Tue Jun 4 17:40:52 CDT 2019
[ 12.054815] NVRM: GPU 0000:d8:00.0: RmInitAdapter failed! (0x26:0xffff:1155)
[ 12.054849] NVRM: GPU 0000:d8:00.0: rm_init_adapter failed, device minor number 2
[ 130.263699] NVRM: GPU 0000:d8:00.0: RmInitAdapter failed! (0x26:0xffff:1155)
[ 130.263737] NVRM: GPU 0000:d8:00.0: rm_init_adapter failed, device minor number 2

cat /proc/driver/nvidia/gpus/0000:d8:00.0/information
Model: Unknown
IRQ: 361
GPU UUID: GPU-???-???-???-???-???
Video BIOS: ??.??.??.??.??
Bus Type: PCIe
DMA Size: 47 bits
DMA Mask: 0x7fffffffffff
Bus Location: 0000:d8:00.0
Device Minor: 2
Blacklisted: No
nvidia-bug-report.log (3.27 MB)

Please check if the card works in another system, if not, RMA.

Just to confirm, if card does not work on another system, shall we consider the card defective ?

Yes, RMA it then.

Thank you. We will try swapping the card and update.

Hey can you give any update about this… And is your problem solved till now.?.
Were able to find original cause of your problem…?