Ubuntu Box with multiple NVIDIA GPU Cards

I recently bought a box from System76 that has multiple GPU’s: one Quadro M6000, and two Tesla K40’s.

When I do lspci | grep -i nvidia it says

05:00.0 VGA compatible controller: NVIDIA Corporation Device 17f0 (rev a1)
05:00.1 Audio device: NVIDIA Corporation Device 0fb0 (rev a1)
06:00.0 3D controller: NVIDIA Corporation GK110BGL [Tesla K40c] (rev a1)
09:00.0 3D controller: NVIDIA Corporation GK110BGL [Tesla K40c] (rev a1)

So, they’re there… But, when I do nvidia-smi -L it only shows

GPU 0: Quadro M6000 (UUID: GPU-09446504-6a9e-866a-a65d-0f1d55b7657b)

and, ls -l /dev/nvidia* shows

crw-rw-rw- 1 root root 195,   0 Aug  9 03:29 /dev/nvidia0
crw-rw-rw- 1 root root 195, 255 Aug  9 03:29 /dev/nvidiactl
crw-rw-rw- 1 root root 248,   0 Aug 12 16:19 /dev/nvidia-uvm

I can’t be sure, but I’m guessing /dev/nvidia0 is the Quadro M6000, and perhaps the fact that there isn’t a /dev/nvidia1 or a /dev/nvidia2, is another symptom (or perhaps the cause) of the box not seeing the Tesla K40’s… Also, my test programs that call cudaGetDeviceCount, yields only 1 GPU…

I’m running Ubuntu 14.04.3, and I’ve installed cuda_7.0.28_linux.run (and installed the NVIDIA drivers via that run file.)

Why are the other cards inaccessible? How do I make them accessible?

Thanks!

So you bought the box configured this way, and it’s not working? Perhaps you should discuss it with System76.

Anyway, try running the following commands as root:

dmesg |grep NVRM

lspci -vvv |grep -i -A 20 nvidia

and report back the output

Hm. Interesting.

dmesg | grep NVRM yields:

[   13.932709] NVRM: The NVIDIA probe routine was not called for 2 device(s).
[   13.932712] NVRM: This can occur when a driver such as: 
[   13.932712] NVRM: nouveau, rivafb, nvidiafb or rivatv 
[   13.932712] NVRM: was loaded and obtained ownership of the NVIDIA device(s).
[   13.932715] NVRM: Try unloading the conflicting kernel module (and/or
[   13.932715] NVRM: reconfigure your kernel without the conflicting
[   13.932715] NVRM: driver(s)), then try loading the NVIDIA kernel module
[   13.932715] NVRM: again.
[   13.932718] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  346.46  Tue Feb 17 17:56:08 PST 2015
[   15.706270] NVRM: Your system is not currently configured to drive a VGA console
[   15.706273] NVRM: on the primary VGA device. The NVIDIA Linux graphics driver
[   15.706274] NVRM: requires the use of a text-mode VGA console. Use of other console
[   15.706275] NVRM: drivers including, but not limited to, vesafb, may result in
[   15.706275] NVRM: corruption and stability problems, and is not supported.

That seems illuminating.

lspci-vvv | grep -i -A 20 nvidia gave me

pcilib: sysfs_read_vpd: read failed: Connection timed out
    05:00.0 VGA compatible controller: NVIDIA Corporation Device 17f0 (rev a1) (prog-if 00 [VGA controller])
    	Subsystem: NVIDIA Corporation Device 1129
    	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
    	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
    	Latency: 0
    	Interrupt: pin A routed to IRQ 68
    	Region 0: Memory at f6000000 (32-bit, non-prefetchable) 
    	Region 1: Memory at a0000000 (64-bit, prefetchable) 
    	Region 3: Memory at b0000000 (64-bit, prefetchable) 
    	Region 5: I/O ports at e000 
    	[virtual] Expansion ROM at f7000000 [disabled] 
    	Capabilities: [60] Power Management version 3
    		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
    		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
    	Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
    		Address: 00000000fee00000  Data: 40c5
    	Capabilities: [78] Express (v2) Legacy Endpoint, MSI 00
    		DevCap:	MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
    			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
    		DevCtl:	Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
    			RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop-
    			MaxPayload 256 bytes, MaxReadReq 512 bytes
    --
    	Kernel driver in use: nvidia
    
    05:00.1 Audio device: NVIDIA Corporation Device 0fb0 (rev a1)
    	Subsystem: NVIDIA Corporation Device 1129
    	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx-
    	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
    	Latency: 0, Cache Line Size: 32 bytes
    	Interrupt: pin B routed to IRQ 67
    	Region 0: Memory at f7080000 (32-bit, non-prefetchable) 
    	Capabilities: [60] Power Management version 3
    		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
    		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
    	Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
    		Address: 0000000000000000  Data: 0000
    	Capabilities: [78] Express (v2) Endpoint, MSI 00
    		DevCap:	MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
    			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
    		DevCtl:	Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
    			RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+
    			MaxPayload 256 bytes, MaxReadReq 512 bytes
    		DevSta:	CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
    		LnkCap:	Port #16, Speed 8GT/s, Width x16, ASPM not supported, Exit Latency L0s <512ns, L1 <4us
    			ClockPM+ Surprise- LLActRep- BwNot-
    		LnkCtl:	ASPM L0s L1 Enabled; RCB 64 bytes Disabled- CommClk-
    --
    06:00.0 3D controller: NVIDIA Corporation GK110BGL [Tesla K40c] (rev a1)
    	Subsystem: NVIDIA Corporation Device 0983
    	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx+
    	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
    	Latency: 0, Cache Line Size: 32 bytes
    	Interrupt: pin A routed to IRQ 64
    	Region 0: Memory at f8000000 (32-bit, non-prefetchable) 
    	Region 1: Memory at 80000000 (64-bit, prefetchable) 
    	Region 3: Memory at 90000000 (64-bit, prefetchable) 
    	Capabilities: [60] Power Management version 3
    		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
    		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
    	Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
    		Address: 00000000fee00000  Data: 4055
    	Capabilities: [78] Express (v2) Endpoint, MSI 00
    		DevCap:	MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
    			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
    		DevCtl:	Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
    			RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+
    			MaxPayload 256 bytes, MaxReadReq 512 bytes
    		DevSta:	CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
    		LnkCap:	Port #8, Speed 8GT/s, Width x16, ASPM not supported, Exit Latency L0s <512ns, L1 <4us
    --
    09:00.0 3D controller: NVIDIA Corporation GK110BGL [Tesla K40c] (rev a1)
    	Subsystem: NVIDIA Corporation Device 0983
    	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx+
    	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
    	Latency: 0, Cache Line Size: 32 bytes
    	Interrupt: pin A routed to IRQ 63
    	Region 0: Memory at fa000000 (32-bit, non-prefetchable) 
    	Region 1: Memory at c0000000 (64-bit, prefetchable) 
    	Region 3: Memory at d0000000 (64-bit, prefetchable) 
    	Capabilities: [60] Power Management version 3
    		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
    		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
    	Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
    		Address: 00000000fee00000  Data: 4045
    	Capabilities: [78] Express (v2) Endpoint, MSI 00
    		DevCap:	MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
    			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
    		DevCtl:	Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
    			RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+
    			MaxPayload 256 bytes, MaxReadReq 512 bytes
    		DevSta:	CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
    		LnkCap:	Port #16, Speed 8GT/s, Width x16, ASPM not supported, Exit Latency L0s <512ns, L1 <4us

I guess, I should try to deactivate that nouveau driver… But, any other insight would still be appreciated… Thanks!

Actually… Looking in /etc/modprobe.d , I found this file: nvidia-installer-disable-nouveau.conf

with this as contents:

# generated by nvidia-installer
blacklist nouveau
options nouveau modeset=0

I would have thought that would have disabled nouveau, yet lsmod shows

nouveau              1368064  0 
mxm_wmi                16384  1 nouveau
video                  20480  2 nouveau,asus_wmi
ttm                    94208  1 nouveau
drm_kms_helper        126976  1 nouveau
drm                   344064  6 ttm,drm_kms_helper,nvidia,nouveau
i2c_algo_bit           16384  2 igb,nouveau
wmi                    20480  3 mxm_wmi,nouveau,asus_wmi

Hm.

read the linux getting started guide:

http://docs.nvidia.com/cuda/cuda-getting-started-guide-for-linux/index.html#runfile-nouveau

it explains how to properly disable nouveau. The blacklist file is not sufficient, if nouveau is in the initrd image.

Excellent… Thanks! For anyone else, what I seem to have needed to do was

rm -f /boot/initrd*
update-initramfs -c -k all
update-grub2

Then, when I did nvidia-smi -L I got

GPU 0: Quadro M6000 (UUID: GPU-09446504-6a9e-866a-a65d-0f1d55b7657b)
GPU 1: Tesla K40c (UUID: GPU-e992022a-724f-8f47-e08f-a954053020e6)
GPU 2: Tesla K40c (UUID: GPU-4d14695e-3e43-bf43-a3e3-91190f696d39)

and, /dev now has nvidia0, nvidia1 and nvidia2…

So, all good now. Thanks again!