CUDA missing GPU

So, I already made the installation of the toolkit CUDA it works during 1 months properly without any problem, yesterday one of my 4 GPU wasn’t available in the nvidia-smi, I’m on Ubuntu 16.04 .

$ lspci -k | grep -EA2 'VGA|3D' 
    02:00.0 VGA compatible controller: NVIDIA Corporation Device 1b80 (rev a1)
        Subsystem: ZOTAC International (MCO) Ltd. Device 1425
        Kernel driver in use: nvidia
    --
    04:00.0 VGA compatible controller: NVIDIA Corporation Device 1b80 (rev a1)
        Subsystem: ZOTAC International (MCO) Ltd. Device 1425
        Kernel driver in use: nvidia
    --
    09:00.0 VGA compatible controller: Matrox Electronics Systems Ltd. G200eR2 (rev 01)
        DeviceName: Embedded Video
        Subsystem: Dell G200eR2
    --
    83:00.0 VGA compatible controller: NVIDIA Corporation Device 1b80 (rev a1)
        Subsystem: ZOTAC International (MCO) Ltd. Device 1425
        Kernel driver in use: nvidia
    --
    84:00.0 VGA compatible controller: NVIDIA Corporation Device 1b80 (rev a1)
        Subsystem: ZOTAC International (MCO) Ltd. Device 1425
        Kernel driver in use: nvidia

     $ nvidia-smi
    Tue May 30 11:04:07 2017       
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 381.09                 Driver Version: 381.09                    |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |===============================+======================+======================|
    |   0  GeForce GTX 1080    On   | 0000:04:00.0     Off |                  N/A |
    | 27%   32C    P8     6W / 180W |      0MiB /  8114MiB |      0%      Default |
    +-------------------------------+----------------------+----------------------+
    |   1  GeForce GTX 1080    On   | 0000:83:00.0     Off |                  N/A |
    | 27%   31C    P8     7W / 180W |      0MiB /  8114MiB |      0%      Default |
    +-------------------------------+----------------------+----------------------+
    |   2  GeForce GTX 1080    On   | 0000:84:00.0     Off |                  N/A |
    | 27%   29C    P8     7W / 180W |      0MiB /  8114MiB |      0%      Default |
    +-------------------------------+----------------------+----------------------+

    +-----------------------------------------------------------------------------+
    | Processes:                                                       GPU Memory |
    |  GPU       PID  Type  Process name                               Usage      |
    |=============================================================================|
    |  No running processes found                                                 |
+-----------------------------------------------------------------------------+

As you can see it’s GPU with the Bus-ID 02:00:0, I also see the syslog for something relevant but I didn’t find something I can exploit, someone have an solution ?

what is the output of:

dmesg | grep NVRM

?

$ dmesg | grep NVRM
[ 8083.018213] NVRM: rm_init_adapter failed for device bearing minor number 0
[ 8089.605163] NVRM: RmInitAdapter failed! (0x26:0xffff:1093)
[ 8089.605205] NVRM: rm_init_adapter failed for device bearing minor number 0
[ 8096.251299] NVRM: RmInitAdapter failed! (0x26:0xffff:1093)
[ 8096.251374] NVRM: rm_init_adapter failed for device bearing minor number 0
[ 8102.930379] NVRM: RmInitAdapter failed! (0x26:0xffff:1093)
......................;

Thanks a lot, i try to google it, i find this topic “https://devtalk.nvidia.com/default/topic/776693/nvrm-rminitadapter-failed-with-gigabyte-gtx-750-on-kubuntu-and-arch/” he update his BIOS …

It might be that your system cannot assign resources for 4 VGA devices like this. What is the output of

lspci -vvv |grep -A 20 NVIDIA

?

lspci -vvv |grep -A 20 NVIDIA
02:00.0 VGA compatible controller: NVIDIA Corporation Device 1b80 (rev a1) (prog-if 00 [VGA controller])
	Subsystem: ZOTAC International (MCO) Ltd. Device 1425
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0
	Interrupt: pin A routed to IRQ 83
	Region 0: Memory at 91000000 (32-bit, non-prefetchable) 
	Region 1: Memory at 3bfe0000000 (64-bit, prefetchable) 
	Region 3: Memory at 3bff0000000 (64-bit, prefetchable) 
	Region 5: I/O ports at 2000 
	[virtual] Expansion ROM at 92080000 [disabled] 
	Capabilities: [60] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
		Address: 0000000000000000  Data: 0000
	Capabilities: [78] Express (v2) Legacy Endpoint, MSI 00
		DevCap:	MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
		DevCtl:	Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported+
			RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
--
02:00.1 Audio device: NVIDIA Corporation Device 10f0 (rev a1)
	Subsystem: ZOTAC International (MCO) Ltd. Device 1425
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 32 bytes
	Interrupt: pin B routed to IRQ 97
	Region 0: Memory at 92000000 (32-bit, non-prefetchable) 
	Capabilities: [60] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
		Address: 0000000000000000  Data: 0000
	Capabilities: [78] Express (v2) Endpoint, MSI 00
		DevCap:	MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
		DevCtl:	Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported+
			RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
			MaxPayload 256 bytes, MaxReadReq 4096 bytes
		DevSta:	CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
		LnkCap:	Port #0, Speed 8GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <1us, L1 <4us
			ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
--
04:00.0 VGA compatible controller: NVIDIA Corporation Device 1b80 (rev a1) (prog-if 00 [VGA controller])
	Subsystem: ZOTAC International (MCO) Ltd. Device 1425
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0
	Interrupt: pin A routed to IRQ 101
	Region 0: Memory at 93000000 (32-bit, non-prefetchable) 
	Region 1: Memory at 3bfc0000000 (64-bit, prefetchable) 
	Region 3: Memory at 3bfd0000000 (64-bit, prefetchable) 
	Region 5: I/O ports at 3000 
	[virtual] Expansion ROM at 94080000 [disabled] 
	Capabilities: [60] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
		Address: 00000000fee00ab8  Data: 0000
	Capabilities: [78] Express (v2) Legacy Endpoint, MSI 00
		DevCap:	MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
		DevCtl:	Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported+
			RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
--
04:00.1 Audio device: NVIDIA Corporation Device 10f0 (rev a1)
	Subsystem: ZOTAC International (MCO) Ltd. Device 1425
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 32 bytes
	Interrupt: pin B routed to IRQ 98
	Region 0: Memory at 94000000 (32-bit, non-prefetchable) 
	Capabilities: [60] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
		Address: 0000000000000000  Data: 0000
	Capabilities: [78] Express (v2) Endpoint, MSI 00
		DevCap:	MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
		DevCtl:	Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported+
			RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
			MaxPayload 256 bytes, MaxReadReq 4096 bytes
		DevSta:	CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
		LnkCap:	Port #0, Speed 8GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <512ns, L1 <4us
			ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
--
83:00.0 VGA compatible controller: NVIDIA Corporation Device 1b80 (rev a1) (prog-if 00 [VGA controller])
	Subsystem: ZOTAC International (MCO) Ltd. Device 1425
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0
	Interrupt: pin A routed to IRQ 102
	Region 0: Memory at ca000000 (32-bit, non-prefetchable) 
	Region 1: Memory at 3ffc0000000 (64-bit, prefetchable) 
	Region 3: Memory at 3ffd0000000 (64-bit, prefetchable) 
	Region 5: I/O ports at 9000 
	[virtual] Expansion ROM at cb080000 [disabled] 
	Capabilities: [60] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
		Address: 00000000fee00138  Data: 0000
	Capabilities: [78] Express (v2) Legacy Endpoint, MSI 00
		DevCap:	MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
		DevCtl:	Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported+
			RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
--
83:00.1 Audio device: NVIDIA Corporation Device 10f0 (rev a1)
	Subsystem: ZOTAC International (MCO) Ltd. Device 1425
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 32 bytes
	Interrupt: pin B routed to IRQ 99
	Region 0: Memory at cb000000 (32-bit, non-prefetchable) 
	Capabilities: [60] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
		Address: 0000000000000000  Data: 0000
	Capabilities: [78] Express (v2) Endpoint, MSI 00
		DevCap:	MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
		DevCtl:	Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported+
			RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
			MaxPayload 256 bytes, MaxReadReq 4096 bytes
		DevSta:	CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
		LnkCap:	Port #0, Speed 8GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <512ns, L1 <4us
			ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
--
84:00.0 VGA compatible controller: NVIDIA Corporation Device 1b80 (rev a1) (prog-if 00 [VGA controller])
	Subsystem: ZOTAC International (MCO) Ltd. Device 1425
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0
	Interrupt: pin A routed to IRQ 103
	Region 0: Memory at c8000000 (32-bit, non-prefetchable) 
	Region 1: Memory at 3ffe0000000 (64-bit, prefetchable) 
	Region 3: Memory at 3fff0000000 (64-bit, prefetchable) 
	Region 5: I/O ports at 8000 
	[virtual] Expansion ROM at c9080000 [disabled] 
	Capabilities: [60] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
		Address: 00000000fee00158  Data: 0000
	Capabilities: [78] Express (v2) Legacy Endpoint, MSI 00
		DevCap:	MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
		DevCtl:	Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported+
			RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
--
84:00.1 Audio device: NVIDIA Corporation Device 10f0 (rev a1)
	Subsystem: ZOTAC International (MCO) Ltd. Device 1425
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 32 bytes
	Interrupt: pin B routed to IRQ 100
	Region 0: Memory at c9000000 (32-bit, non-prefetchable) 
	Capabilities: [60] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
		Address: 0000000000000000  Data: 0000
	Capabilities: [78] Express (v2) Endpoint, MSI 00
		DevCap:	MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
		DevCtl:	Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported+
			RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
			MaxPayload 256 bytes, MaxReadReq 4096 bytes
		DevSta:	CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
		LnkCap:	Port #0, Speed 8GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <512ns, L1 <4us
			ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+

The 4 GPUs work together until this problem, perhaps if I reinstall the driver or do an upgrade? Should I exchange PCIe slot for the GPU?

I don’t see any obvious resource assignment problems on the GPU at 02:00.0

I assume you have tried to reboot the system.

I doubt re-installing the driver would make a difference, but you can try it.

There might be a hardware problem with the GPU or with the motherboard. Yes, you could move GPUs around to see if the problem follows the GPU or the slot.

You could also study the output of

dmesg

to see if there is any additional indication around these lines:

[ 8083.018213] NVRM: rm_init_adapter failed for device bearing minor number 0
[ 8089.605163] NVRM: RmInitAdapter failed! (0x26:0xffff:1093)

I made a BIOS update and the GPU reappear

# nvidia-smi 
Tue Jun  6 14:19:02 2017       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 381.09                 Driver Version: 381.09                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1080    Off  | 0000:02:00.0     Off |                  N/A |
| 27%   37C    P0    40W / 180W |      0MiB /  8114MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 1080    Off  | 0000:04:00.0     Off |                  N/A |
| 27%   38C    P0    39W / 180W |      0MiB /  8114MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX 1080    Off  | 0000:83:00.0     Off |                  N/A |
| 27%   37C    P0    40W / 180W |      0MiB /  8114MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX 1080    Off  | 0000:84:00.0     Off |                  N/A |
|  0%   36C    P0    36W / 180W |      0MiB /  8114MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Thanks a lot txbob .