After I ran the train code using more than 1 GPU, nvidi-smi doesn’t show any results.
As I check the syslog, the following message has been shown.
nvidia-nvlink: IBMNPU: AN interrupt has occurred on NPU device,
NVRM: GPU at PCI:xxxxx
Xid xxxx GPU has fallen off the bus
EEH: Frozen PE#0 on PHB#2 detected
Notify device drivers to shutdown
EEH: Unable to recover from failure from PHB#2-PE#0
It is okay when I ran a train code using a single gpu with TF.
This happens using more than 1 GPU with pytorch
anyone can suggest solution ?
Nvidia : 440.64.00