NVIDIA Developer Forums

Frozen device, nvidia-smi no response (blocked for more than 120 seconds)

Graphics / Linux Linux

labohemekim May 13, 2020, 8:29am 1

After I ran the train code using more than 1 GPU, nvidi-smi doesn’t show any results.
As I check the syslog, the following message has been shown.

nvidia-nvlink: IBMNPU: AN interrupt has occurred on NPU device,
…
NVRM: GPU at PCI:xxxxx
Xid xxxx GPU has fallen off the bus
EEH: Frozen PE#0 on PHB#2 detected
…
Notify device drivers to shutdown
EEH: Unable to recover from failure from PHB#2-PE#0

It is okay when I ran a train code using a single gpu with TF.
This happens using more than 1 GPU with pytorch

anyone can suggest solution ?
Nvidia : 440.64.00
cuda 10.2
Tesla P100

generix May 13, 2020, 8:34am 2

XID 79 is most often caused by either overheating or insufficient power supply. Please monitor temperatures, check/reconnect power cords, check/replace psu.

labohemekim May 13, 2020, 8:40am 3

Wow, even though I didn’t put the whole log in details, you suggest the solution.
some of logs details are folloing,
NVRM: Cid (PCI:…) 79, pid, GPU has fallen off the bus…