Syncbatchnorm and DDP causes crash

Running DDP with BatchSyncNorm.
The training will run for a couple of batches and the all GPUs fall off the bus.
The training runs fine without BatchSyncNorm.

This issue occurs in two models, deeplabv3 and another model, that I have tested so far.

Code

      os.environ['MASTER_ADDR'] = 'localhost'
      os.environ['MASTER_PORT'] = '123553'
      dist.init_process_group("gloo", rank=rank, world_size=world_size)
      model = models[model](num_classes=ncls)
      model = torch.nn.SyncBatchNorm.convert_sync_batchnorm(model)
      model = model.cuda(device)
      model = DDP(model, device_ids=[rank], find_unused_parameters=True)
      optimizer = torch.optim.Adam(model.parameters())

      E,B = epochs, len(trn_dtldr)
      for e in range(E):
          for b, batch in enumerate(trn_dtldr):
              x,y = batch['x'].cuda(device), batch['y'].cuda(device)
              py = model(x)

              optimizer.zero_grad()
              l = loss(py,y)
              l.backward()
              optimizer.step()
      cleanup()

In the error log, this issue occurs at Aug 26 15:28:16
nvidia-bug-report.log.gz (2.7 MB)
Error Log Excerpt

Aug 26 15:27:32 ailcm-ai1 kernel: [87430.975714] nvidia-uvm: Loaded the UVM driver, major device number 511.
Aug 26 15:28:16 ailcm-ai1 kernel: [87474.832707] NVRM: GPU at PCI:0000:17:00: GPU-f114b259-1fd6-a53b-8882-7d59735e0271
Aug 26 15:28:16 ailcm-ai1 kernel: [87474.832733] NVRM: GPU Board Serial Number: 
Aug 26 15:28:16 ailcm-ai1 kernel: [87474.832736] NVRM: Xid (PCI:0000:17:00): 79, pid=24953, GPU has fallen off the bus.
Aug 26 15:28:16 ailcm-ai1 kernel: [87474.850885] NVRM: GPU 0000:17:00.0: GPU has fallen off the bus.
Aug 26 15:28:16 ailcm-ai1 kernel: [87474.850886] NVRM: GPU 0000:17:00.0: GPU is on Board .
Aug 26 15:28:16 ailcm-ai1 kernel: [87474.850896] NVRM: A GPU crash dump has been created. If possible, please run
Aug 26 15:28:16 ailcm-ai1 kernel: [87474.850896] NVRM: nvidia-bug-report.sh as root to collect this data before
Aug 26 15:28:16 ailcm-ai1 kernel: [87474.850896] NVRM: the NVIDIA kernel module is unloaded.
Aug 26 15:28:16 ailcm-ai1 kernel: [87474.851013] NVRM: GPU at PCI:0000:65:00: GPU-ec0ca56f-dbab-4d7f-8341-0bfd7d5f8a24
Aug 26 15:28:16 ailcm-ai1 kernel: [87474.851016] NVRM: GPU Board Serial Number: 
Aug 26 15:28:16 ailcm-ai1 kernel: [87474.851019] NVRM: Xid (PCI:0000:65:00): 79, pid=24955, GPU has fallen off the bus.
Aug 26 15:28:16 ailcm-ai1 kernel: [87474.851117] NVRM: GPU 0000:65:00.0: GPU has fallen off the bus.
Aug 26 15:28:16 ailcm-ai1 kernel: [87474.851119] NVRM: GPU 0000:65:00.0: GPU is on Board .
Aug 26 15:28:16 ailcm-ai1 kernel: [87474.852082] NVRM: GPU at PCI:0000:18:00: GPU-499e9a48-65fb-b44f-21ef-1475bafeba1e
Aug 26 15:28:16 ailcm-ai1 kernel: [87474.852083] NVRM: GPU Board Serial Number: 
Aug 26 15:28:16 ailcm-ai1 kernel: [87474.852085] NVRM: Xid (PCI:0000:18:00): 79, pid=24955, GPU has fallen off the bus.
Aug 26 15:28:16 ailcm-ai1 kernel: [87474.852179] NVRM: GPU 0000:18:00.0: GPU has fallen off the bus.
Aug 26 15:28:16 ailcm-ai1 kernel: [87474.852181] NVRM: GPU 0000:18:00.0: GPU is on Board .
Aug 26 15:28:16 ailcm-ai1 kernel: [87474.852187] NVRM: GPU at PCI:0000:b4:00: GPU-de26af30-1122-e55a-1bb5-ed7732106434
Aug 26 15:28:16 ailcm-ai1 kernel: [87474.852189] NVRM: GPU Board Serial Number: 
Aug 26 15:28:16 ailcm-ai1 kernel: [87474.852191] NVRM: Xid (PCI:0000:b4:00): 79, pid=24955, GPU has fallen off the bus.
Aug 26 15:28:16 ailcm-ai1 kernel: [87474.852286] NVRM: GPU 0000:b4:00.0: GPU has fallen off the bus.
Aug 26 15:28:16 ailcm-ai1 kernel: [87474.852287] NVRM: GPU 0000:b4:00.0: GPU is on Board .
Aug 26 15:28:16 ailcm-ai1 kernel: [87474.937436] nvidia-gpu 0000:b4:00.3: Refused to change power state, currently in D3
Aug 26 15:28:16 ailcm-ai1 kernel: [87474.997587] nvidia-gpu 0000:18:00.3: Refused to change power state, currently in D3
Aug 26 15:28:16 ailcm-ai1 kernel: [87475.057545] nvidia-gpu 0000:65:00.3: Refused to change power state, currently in D3
Aug 26 15:28:17 ailcm-ai1 kernel: [87476.298279] nvidia-gpu 0000:18:00.3: i2c timeout error ffffffff
Aug 26 15:28:17 ailcm-ai1 kernel: [87476.298427] nvidia-gpu 0000:b4:00.3: i2c timeout error ffffffff
Aug 26 15:28:17 ailcm-ai1 kernel: [87476.298570] nvidia-gpu 0000:65:00.3: i2c timeout error ffffffff

Nvidia-smi

Unable to determine the device handle for GPU 0000:17:00.0: GPU is lost. Reboot the system to recover this GPU

Environment


Collecting environment information...
PyTorch version: 1.6.0
Is debug build: False
CUDA used to build PyTorch: 10.2

OS: Ubuntu 18.04.4 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: version 3.10.2

Python version: 3.6 (64-bit runtime)
Is CUDA available: False
CUDA runtime version: No CUDA
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA

Versions of relevant libraries:
[pip3] numpy==1.19.1
[pip3] torch==1.6.0
[pip3] torchvision==0.7.0
[conda] Could not collect

This is a cross-post from here.

I suggested to collect the nvidia-bug-report.sh and to create a topic in this board so that we could track it.
@generix would you have any ideas, what might be the potential root cause based on the attached log?
I don’t think it’s related to PyTorch, but might be a hardware/driver/temperature/PSU issue, but don’t know how to further isolate it.