simpleP2P test makes the system rebooted on Lenovo nx360 M5 with 2 M40 GPUs

idle · February 23, 2017, 5:52am

My system is Lenovo nx360 M5 with 2M40 GPU and it installed with Centos 7 and cuda 8.0 (driver 375.39). While I do training with 2GPU by tensorflow, it makes system reboot. But one GPU runs well. I guess it is peer to peer memory accessing issue, so I run simpleP2P, but it also makes the system reboot.Anyone can help on it?

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+

[./simpleP2P] - Starting…
Checking for multiple GPUs…
CUDA-capable device count: 2

GPU0 = " Tesla M40" IS capable of Peer-to-Peer (P2P)
GPU1 = " Tesla M40" IS capable of Peer-to-Peer (P2P)

Checking GPU(s) for support of peer to peer memory access…

Peer access from Tesla M40 (GPU0) → Tesla M40 (GPU1) : Yes
Peer access from Tesla M40 (GPU1) → Tesla M40 (GPU0) : Yes
Enabling peer access between GPU0 and GPU1…
Checking GPU0 and GPU1 for UVA capabilities…
Tesla M40 (GPU0) supports UVA: Yes
Tesla M40 (GPU1) supports UVA: Yes
Both GPUs can support UVA, enabling…
Allocating buffers (64MB on GPU0, GPU1 and CPU Host)…
Creating event handles…

Robert_Crovella · February 23, 2017, 3:23pm

Does your lenovo nx360m5 have the latest system firmware in it?

If not, can you update to the latest firmware and try the test again?

rer · June 13, 2017, 12:49pm

I’m having this same problem on a Lenovo nx360m5 with four K80s. It has the latest formware, and CUDA Toolkit 8.0, Driver Version 375.66, RHEL 7.2. When running simplep2p, if anything tries to go through PXB, the system reboots.

Robert_Crovella · June 13, 2017, 2:39pm

You may want to escalate with Lenovo.