Hi Everybody.
When we tries to execute to instances of TensorFlow example in a K80 GPU in linux, the GPU generates a critical error and we have to shutdown and restart the server in order to recover them.
We made several test:
- Just one instance of the program without CUDA_VISIBLE_DEVICE variable set. Allocates both GPUs, only uses one and it works well
- Just one instance of the program with CUDA_VISIBLE_DEVICE variable set to 0 or 1 (indicating which GPU to use), only allocate and uses the indicated GPU. it works without problem
- Executing 2 instances of the sample program, setting CUDA_VISIBLE_DEVICE variable to 0 and 1, generate a lost of communication between the host and the GPU card. We have to shutdown the server and restart in order to recover the GPUY and the server. .
If anybody has any idea about what can happens, I will appreciate it so mouch.
Thanks in advance
H.
P.S: Include the error messages:
The operating system is Linux kernel 4.9.34, Gentoo distribution and the nvidia driver is 384.59.
The error showed is:
2017-07-31 16:33:04.815885: E tensorflow/stream_executor/cuda/cuda_driver.cc:1098] could not synchronize on CUDA context: CUDA_ERROR_UNKNOWN :: No stack trace available
2017-07-31 16:33:04.815980: F tensorflow/core/common_runtime/gpu/gpu_util.cc:370] GPU sync failed
The message in the console is:
[Jul31 16:33] NVRM: GPU at PCI:0000:83:00: GPU-0694dbcb-6cb3-c7bb-05f8-4185fe20d67c
[ +0.000018] NVRM: GPU Board Serial Number: 0323615069723
[ +0.000004] NVRM: Xid (PCI:0000:83:00): 79, GPU has fallen off the bus.
[ +0.000001] NVRM: GPU at 0000:83:00.0 has fallen off the bus.
[ +0.000015] NVRM: GPU is on Board 0323615069723.
[ +0.000011] NVRM: A GPU crash dump has been created. If possible, please run
NVRM: nvidia-bug-report.sh as root to collect this data before
NVRM: the NVIDIA kernel module is unloaded.
[ +0.000011] NVRM: GPU at PCI:0000:84:00: GPU-e405f17d-0c15-c2b1-eaf6-2f50a0c524ec
[ +0.000019] NVRM: GPU Board Serial Number: 0323615069723
[ +0.000002] NVRM: Xid (PCI:0000:84:00): 79, GPU has fallen off the bus.
[ +0.000001] NVRM: GPU at 0000:84:00.0 has fallen off the bus.
[ +0.000019] NVRM: GPU is on Board 0323615069723.
We setup the BIOS to recognise the GPU memory:
83:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
Subsystem: NVIDIA Corporation GK210GL [Tesla K80]
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0
Interrupt: pin A routed to IRQ 53
NUMA node: 1
Region 0: Memory at fa000000 (32-bit, non-prefetchable)
Region 1: Memory at 39f800000000 (64-bit, prefetchable)
Region 3: Memory at 39fc00000000 (64-bit, prefetchable)
Capabilities: [60] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
Address: 00000000fee20000 Data: 4034
Capabilities: [78] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 25.000W
DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
MaxPayload 256 bytes, MaxReadReq 512 bytes
DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
--
Kernel driver in use: nvidia
Kernel modules: nvidia_drm, nvidia
84:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
Subsystem: NVIDIA Corporation GK210GL [Tesla K80]
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0
Interrupt: pin A routed to IRQ 54
NUMA node: 1
Region 0: Memory at f9000000 (32-bit, non-prefetchable)
Region 1: Memory at 39f000000000 (64-bit, prefetchable)
Region 3: Memory at 39f400000000 (64-bit, prefetchable)
Capabilities: [60] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
Address: 00000000fee20000 Data: 4044
Capabilities: [78] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 25.000W
DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
MaxPayload 256 bytes, MaxReadReq 512 bytes
DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
--
Kernel driver in use: nvidia
Kernel modules: nvidia_drm, nvidia
ff:08.0 System peripheral: Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D QPI Link 0 (rev 01)
Subsystem: Super Micro Computer Inc Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D QPI Link 0
Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
ff:08.2 Performance counters: Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D QPI Link 0 (rev 01)
Subsystem: Super Micro Computer Inc Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D QPI Link 0
Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Kernel driver in use: bdx_uncore
ff:08.3 System peripheral: Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D QPI Link 0 (rev 01)
Subsystem: Super Micro Computer Inc Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D QPI Link 0
Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Capabilities: [40] Express (v1) Root Complex Integrated Endpoint, MSI 00
DevCap: MaxPayload 128 bytes, PhantFunc 0
ExtTag- RBE-
DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-