Dual Quadro K6000 on RHEL 7.1 - only ONE detected by Nvidia but the lspci sees TWO

2 Quadro K6000 cards installed in Dell 7910 with RHEL 7.1 I have installed CUDA toolkit 7.0 and the latest Linux drivers.

lspci gives me:

[root@KWFT7910 puremd_rc_1001]# lspci | grep -i NVIDIA
03:00.0 VGA compatible controller: NVIDIA Corporation GK110GL [Quadro K6000] (rev a1)
03:00.1 Audio device: NVIDIA Corporation GK110 HDMI Audio (rev a1)
04:00.0 VGA compatible controller: NVIDIA Corporation GK110GL [Quadro K6000] (rev a1)
04:00.1 Audio device: NVIDIA Corporation GK110 HDMI Audio (rev a1)
[root@KWFT7910 puremd_rc_1001]#

So, the system sees 2 K6000s

nvidia-smi gives me:

[root@KWFT7910 puremd_rc_1001]# nvidia-smi
Sun Jul 26 20:51:59 2015
±-----------------------------------------------------+
| NVIDIA-SMI 352.21 Driver Version: 352.21 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Quadro K6000 Off | 0000:03:00.0 On | Off |
| 27% 44C P8 21W / 225W | 401MiB / 12287MiB | 0% Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 2183 G /usr/bin/Xorg 176MiB |
| 0 3964 G /usr/bin/gnome-shell 199MiB |
±----------------------------------------------------------------------------+
[root@KWFT7910 puremd_rc_1001]#

Nvidia only sees 1 K6000

My xorg.conf looks like this:

nvidia-xconfig: X configuration file generated by nvidia-xconfig

nvidia-xconfig: version 352.21 (buildmeister@swio-display-x64-rhel04-13) Tue Jun 9 22:44:03 PDT 2015

Section “ServerLayout”
Identifier “Layout0”
Screen 0 “Screen0” 0 0
InputDevice “Keyboard0” “CoreKeyboard”
InputDevice “Mouse0” “CorePointer”
EndSection

Section “Files”
FontPath “/usr/share/fonts/default/Type1”
EndSection

Section “InputDevice”

# generated from default
Identifier     "Mouse0"
Driver         "mouse"
Option         "Protocol" "auto"
Option         "Device" "/dev/input/mice"
Option         "Emulate3Buttons" "no"
Option         "ZAxisMapping" "4 5"

EndSection

Section “InputDevice”

# generated from default
Identifier     "Keyboard0"
Driver         "kbd"

EndSection

Section “Monitor”
Identifier “Monitor0”
VendorName “Unknown”
ModelName “Unknown”
HorizSync 28.0 - 33.0
VertRefresh 43.0 - 72.0
Option “DPMS”
EndSection

Section “Device”
Identifier “Device0”
Driver “nvidia”
VendorName “NVIDIA Corporation”
EndSection

Section “Device”
Identifier “Device1”
Driver “nvidia”
VendorName “NVIDIA Corporation”
EndSection

Section “Screen”
Identifier “Screen0”
Device “Device0”
Monitor “Monitor0”
DefaultDepth 24
Option “MultiGPU” “on”
SubSection “Display”
Depth 24
EndSubSection
EndSection

I have looked all over the Nvidia site and used Google to find the way to setup these two cards so both are visible (and useable for CUDA) by Nvidia. I have yet to find a solution.

Does anyone know what needs to be done to configure this system to see that both cards work properly?

Thanks in advance for any help.

Jim Kress

Did you follow all the instructions in the linux getting started guide?

http://docs.nvidia.com/cuda/cuda-getting-started-guide-for-linux/index.html#abstract

For example, what installation method did you use? package manager or runfile installer?

For the given method chosen, did you have any previous drivers or toolkits installed via the other method?

Did you remove the nouveau driver from the system?

Finally, what is the output of the following, run as root:

lspci -vvv |grep -i -A 20 nvidia

Yes.

runfile

only the noveau which I disabled per the directions provided by Nvidia, including in GRUB.

[root@KWFT7910 release]# lspci -vvv | grep -i -A 20 nvidia
03:00.0 VGA compatible controller: NVIDIA Corporation GK110GL [Quadro K6000] (rev a1) (prog-if 00 [VGA controller])
Subsystem: NVIDIA Corporation Device 1036
Physical Slot: 2
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- SERR- <PERR- INTx-
Latency: 0
Interrupt: pin A routed to IRQ 115
Region 0: Memory at ea000000 (32-bit, non-prefetchable)
Region 1: Memory at d0000000 (64-bit, prefetchable)
Region 3: Memory at e0000000 (64-bit, prefetchable)
Region 5: I/O ports at 5000
[virtual] Expansion ROM at eb000000 [disabled]
Capabilities: [60] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
Address: 00000000fee00718 Data: 0000
Capabilities: [78] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop-

Kernel driver in use: nvidia

03:00.1 Audio device: NVIDIA Corporation GK110 HDMI Audio (rev a1)
Subsystem: NVIDIA Corporation Device 1036
Physical Slot: 2
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- SERR- <PERR- INTx-
Latency: 0
Interrupt: pin B routed to IRQ 36
Region 0: Memory at eb080000 (32-bit, non-prefetchable)
Capabilities: [60] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
Address: 0000000000000000 Data: 0000
Capabilities: [78] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
MaxPayload 128 bytes, MaxReadReq 512 bytes
DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM not supported, Exit Latency L0s <512ns, L1 <4us
ClockPM+ Surprise- LLActRep- BwNot-

04:00.0 VGA compatible controller: NVIDIA Corporation GK110GL [Quadro K6000] (rev a1) (prog-if 00 [VGA controller])
Subsystem: NVIDIA Corporation Device 1036
Physical Slot: 4
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- SERR- <PERR- INTx-
Latency: 0
Interrupt: pin A routed to IRQ 40
Region 0: Memory at e8000000 (32-bit, non-prefetchable)
Region 1: Memory at b0000000 (64-bit, prefetchable)
Region 3: Memory at c0000000 (64-bit, prefetchable)
Region 5: I/O ports at 4000
Expansion ROM at e9000000 [disabled]
Capabilities: [60] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
Address: 00000000fee00738 Data: 0000
Capabilities: [78] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+

Kernel driver in use: nvidia

04:00.1 Audio device: NVIDIA Corporation GK110 HDMI Audio (rev a1)
Subsystem: NVIDIA Corporation Device 1036
Physical Slot: 4
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- SERR- <PERR- INTx-
Latency: 0
Interrupt: pin B routed to IRQ 44
Region 0: Memory at e9080000 (32-bit, non-prefetchable)
Capabilities: [60] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
Address: 0000000000000000 Data: 0000
Capabilities: [78] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
MaxPayload 128 bytes, MaxReadReq 512 bytes
DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM not supported, Exit Latency L0s <512ns, L1 <4us
ClockPM+ Surprise- LLActRep- BwNot-
[root@KWFT7910 release]#

What is the output of:

dmesg |grep NVRM

what is the output of (as root):

yum list nvidia*

[root@KWFT7910 ~]# dmesg | grep NVRM
[ 14.860572] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 352.21 Tue Jun 9 21:53:31 PDT 2015
[ 20.896395] NVRM: failed to copy vbios to system memory.
[ 20.897422] NVRM: RmInitAdapter failed! (0x30:0xffff:844)
[ 20.897440] NVRM: rm_init_adapter failed for device bearing minor number 1
[ 20.897483] NVRM: nvidia_frontend_open: minor 1, module->open() failed, error -5
[ 23.251314] NVRM: failed to copy vbios to system memory.
[ 23.252339] NVRM: RmInitAdapter failed! (0x30:0xffff:844)
[ 23.252354] NVRM: rm_init_adapter failed for device bearing minor number 1
[ 23.252389] NVRM: nvidia_frontend_open: minor 1, module->open() failed, error -5
[ 70.206385] NVRM: failed to copy vbios to system memory.
[ 70.206800] NVRM: RmInitAdapter failed! (0x30:0xffff:844)
[ 70.206807] NVRM: rm_init_adapter failed for device bearing minor number 1
[ 70.206820] NVRM: nvidia_frontend_open: minor 1, module->open() failed, error -5
[ 2884.047262] NVRM: failed to copy vbios to system memory.
[ 2884.047741] NVRM: RmInitAdapter failed! (0x30:0xffff:844)
[ 2884.047746] NVRM: rm_init_adapter failed for device bearing minor number 1
[ 2884.047759] NVRM: nvidia_frontend_open: minor 1, module->open() failed, error -5
[ 9738.560040] NVRM: failed to copy vbios to system memory.
[ 9738.560445] NVRM: RmInitAdapter failed! (0x30:0xffff:844)
[ 9738.560450] NVRM: rm_init_adapter failed for device bearing minor number 1
[ 9738.560462] NVRM: nvidia_frontend_open: minor 1, module->open() failed, error -5
[11800.812614] NVRM: failed to copy vbios to system memory.
[11800.813084] NVRM: RmInitAdapter failed! (0x30:0xffff:844)
[11800.813103] NVRM: rm_init_adapter failed for device bearing minor number 1
[11800.813116] NVRM: nvidia_frontend_open: minor 1, module->open() failed, error -5
[13158.660848] NVRM: failed to copy vbios to system memory.
[13158.661278] NVRM: RmInitAdapter failed! (0x30:0xffff:844)
[13158.661285] NVRM: rm_init_adapter failed for device bearing minor number 1
[13158.661298] NVRM: nvidia_frontend_open: minor 1, module->open() failed, error -5
[13235.788078] NVRM: failed to copy vbios to system memory.
[13235.788587] NVRM: RmInitAdapter failed! (0x30:0xffff:844)
[13235.788593] NVRM: rm_init_adapter failed for device bearing minor number 1
[13235.788617] NVRM: nvidia_frontend_open: minor 1, module->open() failed, error -5
[15565.218700] NVRM: failed to copy vbios to system memory.
[15565.219142] NVRM: RmInitAdapter failed! (0x30:0xffff:844)
[15565.219148] NVRM: rm_init_adapter failed for device bearing minor number 1
[15565.219162] NVRM: nvidia_frontend_open: minor 1, module->open() failed, error -5
[root@KWFT7910 ~]#

[root@KWFT7910 ~]# yum list nvidia*
Loaded plugins: langpacks, product-id, subscription-manager
Error: No matching Packages to list
[root@KWFT7910 ~]#

Both K6000 devices require aux power connections. Do you have the necessary aux power cables/connections delivered to each card?

Do you have the latest Dell System BIOS installed on your 7910? What version is it?

Yes, both K6000 devices have the correct and complete power connections. I’ll have to check on the BIOS.

Yes, the latest BIOS is A01 and that is what I have on the workstation.

Is this a T7910 or an R7910? Try running these commands as root:

dmidecode -s system-product-name
dmidecode -s bios-version

and report back what is output.

For T7910, the latest public BIOS is A07, not A01:

http://www.dell.com/support/home/us/en/04/product-support/product/precision-t7910-workstation/drivers

For the R7910, the BIOS doesn’t use A0x numbering:

http://www.dell.com/support/home/us/en/04/product-support/product/precision-r7910-workstation/drivers

If you have a T7910, please try updating to the A07 BIOS first. If you have an R7910, please report your actual BIOS version.

Also, my colleague suggests the following:

"Also, he should probably try enabling 4GB MMIO in SBIOS setup if he hasn’t done that already. "

Actually, I have a T7910 XL. It is limited to its A01 BIOS. The A07 BIOS is for the vanilla T7910.

I’ll try enabling the 4GB MMIO.

The 4GB MMIO made no difference.

I have resolved the problem. One of the K6000 cards was defective.