nvidia-smi is not detecting one card out of three

maheshcn · October 9, 2015, 5:59am

Hi,

I have three Tesla M2090 on my machine. But nvidia-smi is detecting only two cards.
OS is Debian Jessie amd64 , cuda 7.5

Output of nvidia-smi

Output of lspci | grep Tesla

04:00.0 3D controller: NVIDIA Corporation GF110GL [Tesla M2090] (rev a1)
83:00.0 3D controller: NVIDIA Corporation GF110GL [Tesla M2090] (rev a1)
84:00.0 3D controller: NVIDIA Corporation GF110GL [Tesla M2090] (rev a1)

Can some help me to fix this issue ?

Robert_Crovella · October 9, 2015, 2:11pm

What kind of system are the M2090 cards installed in? (manufacturer, and model number)
Do all 3 M2090 cards have the correct auxiliary power connections supplied?

what is the output of the following command, as root:

dmesg |grep NVRM

maheshcn · March 7, 2016, 9:52am

Yes. auxilary power is connected on all cards. Cards are from NVIDIA .

Output of dmesg | grep NVIDIA

[ 40.403805] nvidia: module license ‘NVIDIA’ taints kernel.
[ 40.433790] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
[ 40.435838] NVRM: The NVIDIA probe routine failed for 1 device(s).
[ 40.435840] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 352.39 Fri Aug 14 18:09:10 PDT 2015
[11564.376205] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:

Output of lspci | grep NVIDIA

04:00.0 3D controller: NVIDIA Corporation GF110GL [Tesla M2090] (rev a1)
83:00.0 3D controller: NVIDIA Corporation GF110GL [Tesla M2090] (rev a1)
84:00.0 3D controller: NVIDIA Corporation GF110GL [Tesla M2090] (rev a1)

Kernel is 3.16.0-4-amd64

njuffa · March 7, 2016, 3:05pm

I am not an expert at PCIe configuration, so would definitely suggest to wait for txbob to respond. This seems to be the problem:

As far as I understand this likely indicates a problem with the system BIOS for your machine, which is unable to correctly map some required PCIe resource for one of the GPUs, possible the BAR1 aperture.

Depending on what kind of system this is, a system BIOS update may be available from the vendor. I know from personal experience that Dell and HP offer system BIOS updates for quite a number of years, I recall installing such updates about three years after system purchase.

Robert_Crovella · March 7, 2016, 7:04pm

What njuffa said. If a system bios update doesn’t fix the issue, there is nothing to be done for it except move the cards to another system.

maheshcn · March 8, 2016, 3:02am

This happened only after the OS update to a new version (Debian 7 to 8). 7 is detecting all the cards properly .

njuffa · March 8, 2016, 4:07am

If you look at the Linux kernel bug logs, you will find that there are some as-of-yet unresolved issues regarding the allocation of PCIe resources, going back to at least 2012. I am not a kernel expert, but my understanding is that in some cases Linux, instead of falling back on the system BIOS for the allocation of PCIe resources, tries to be smarter and then fails. One bug report entitled “PCI resource assignments fail due to poor allocation strategy” can be found here: [url]Invalid Bug ID

Since you already know that your issue was caused by a change in the OS, a reasonable conclusion would be: the new Debian introduced a bug, revert to Debian 7, possibly file a bug report against the makers of Debian. Please note that Debian is not a supported Linux platform for CUDA 7.5 according to the Linux Installation Guide: [url]Installation Guide Linux :: CUDA Toolkit Documentation