NVRM: RmInitAdapter failed! and one GPU missing (Ubuntu 16.04 with 2 x 1080ti)

Hi, I have a system with two GPUs. Both were working fine, but ultimately, this error appear on the system log, and now there is only one GPU visible.

NVRM: RmInitAdapter failed! (0x26:0xffff:1123)
NVRM: rm_init_adapter failed for device bearing minor number 1

Here are some diagnostics, are there others that I could run? any information is appreciated. Thanks

17:00.0 VGA compatible controller: NVIDIA Corporation Device 1b06 (rev a1) (prog-if 00 [VGA controller])
	Subsystem: ZOTAC International (MCO) Ltd. Device 1470
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0
	Interrupt: pin A routed to IRQ 33
	Region 0: Memory at c4000000 (32-bit, non-prefetchable) 
	Region 1: Memory at 387fe0000000 (64-bit, prefetchable) 
	Region 3: Memory at 387ff0000000 (64-bit, prefetchable) 
	Region 5: I/O ports at 7000 
	[virtual] Expansion ROM at c5000000 [disabled] 
	Capabilities: <access denied>
	Kernel driver in use: nvidia
	Kernel modules: nvidiafb, nouveau, nvidia_396, nvidia_396_drm
--
65:00.0 VGA compatible controller: NVIDIA Corporation Device 1b06 (rev a1) (prog-if 00 [VGA controller])
	Subsystem: ZOTAC International (MCO) Ltd. Device 1470
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0
	Interrupt: pin A routed to IRQ 34
	Region 0: Memory at df000000 (32-bit, non-prefetchable) 
	Region 1: Memory at 38bfe0000000 (64-bit, prefetchable) 
	Region 3: Memory at 38bff0000000 (64-bit, prefetchable) 
	Region 5: I/O ports at b000 
	[virtual] Expansion ROM at e0000000 [disabled] 
	Capabilities: <access denied>
	Kernel driver in use: nvidia
	Kernel modules: nvidiafb, nouveau, nvidia_396, nvidia_396_drm

~ nvidia-smi
Fri Aug  3 15:00:16 2018       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.45                 Driver Version: 396.45                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:17:00.0 Off |                  N/A |
| 23%   29C    P0    56W / 250W |      0MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

nvidia-bug-report.log.gz (2.24 MB)

I found elsewhere that these boot parameters where helpful, and indeed they where, but only lasted one reboot. It would be nice to know what are they doing to see if they could compromise the system.

pcie_aspm=off rcutree.rcu_idle_gp_delay=1

pcie_aspm=off in your case does nothing as the bios of your system doesn’t support aspm anyway.
rcutree.rcu_idle_gp_delay=1 lets the rcu mechanism wake up the cpu earlier. Has minimal impact on energy efficiency. Shouldn’t be necessary on a modern kernel.
You should check if your slot is working properly, reseat the card or change the slot.