Syncbatchnorm and DDP causes crash

yechanhong · August 27, 2020, 5:33am

Running DDP with BatchSyncNorm.
The training will run for a couple of batches and the all GPUs fall off the bus.
The training runs fine without BatchSyncNorm.

This issue occurs in two models, deeplabv3 and another model, that I have tested so far.

Code

      os.environ['MASTER_ADDR'] = 'localhost'
      os.environ['MASTER_PORT'] = '123553'
      dist.init_process_group("gloo", rank=rank, world_size=world_size)
      model = models[model](num_classes=ncls)
      model = torch.nn.SyncBatchNorm.convert_sync_batchnorm(model)
      model = model.cuda(device)
      model = DDP(model, device_ids=[rank], find_unused_parameters=True)
      optimizer = torch.optim.Adam(model.parameters())

      E,B = epochs, len(trn_dtldr)
      for e in range(E):
          for b, batch in enumerate(trn_dtldr):
              x,y = batch['x'].cuda(device), batch['y'].cuda(device)
              py = model(x)

              optimizer.zero_grad()
              l = loss(py,y)
              l.backward()
              optimizer.step()
      cleanup()

In the error log, this issue occurs at Aug 26 15:28:16
nvidia-bug-report.log.gz (2.7 MB)
Error Log Excerpt

Aug 26 15:27:32 ailcm-ai1 kernel: [87430.975714] nvidia-uvm: Loaded the UVM driver, major device number 511.
Aug 26 15:28:16 ailcm-ai1 kernel: [87474.832707] NVRM: GPU at PCI:0000:17:00: GPU-f114b259-1fd6-a53b-8882-7d59735e0271
Aug 26 15:28:16 ailcm-ai1 kernel: [87474.832733] NVRM: GPU Board Serial Number: 
Aug 26 15:28:16 ailcm-ai1 kernel: [87474.832736] NVRM: Xid (PCI:0000:17:00): 79, pid=24953, GPU has fallen off the bus.
Aug 26 15:28:16 ailcm-ai1 kernel: [87474.850885] NVRM: GPU 0000:17:00.0: GPU has fallen off the bus.
Aug 26 15:28:16 ailcm-ai1 kernel: [87474.850886] NVRM: GPU 0000:17:00.0: GPU is on Board .
Aug 26 15:28:16 ailcm-ai1 kernel: [87474.850896] NVRM: A GPU crash dump has been created. If possible, please run
Aug 26 15:28:16 ailcm-ai1 kernel: [87474.850896] NVRM: nvidia-bug-report.sh as root to collect this data before
Aug 26 15:28:16 ailcm-ai1 kernel: [87474.850896] NVRM: the NVIDIA kernel module is unloaded.
Aug 26 15:28:16 ailcm-ai1 kernel: [87474.851013] NVRM: GPU at PCI:0000:65:00: GPU-ec0ca56f-dbab-4d7f-8341-0bfd7d5f8a24
Aug 26 15:28:16 ailcm-ai1 kernel: [87474.851016] NVRM: GPU Board Serial Number: 
Aug 26 15:28:16 ailcm-ai1 kernel: [87474.851019] NVRM: Xid (PCI:0000:65:00): 79, pid=24955, GPU has fallen off the bus.
Aug 26 15:28:16 ailcm-ai1 kernel: [87474.851117] NVRM: GPU 0000:65:00.0: GPU has fallen off the bus.
Aug 26 15:28:16 ailcm-ai1 kernel: [87474.851119] NVRM: GPU 0000:65:00.0: GPU is on Board .
Aug 26 15:28:16 ailcm-ai1 kernel: [87474.852082] NVRM: GPU at PCI:0000:18:00: GPU-499e9a48-65fb-b44f-21ef-1475bafeba1e
Aug 26 15:28:16 ailcm-ai1 kernel: [87474.852083] NVRM: GPU Board Serial Number: 
Aug 26 15:28:16 ailcm-ai1 kernel: [87474.852085] NVRM: Xid (PCI:0000:18:00): 79, pid=24955, GPU has fallen off the bus.
Aug 26 15:28:16 ailcm-ai1 kernel: [87474.852179] NVRM: GPU 0000:18:00.0: GPU has fallen off the bus.
Aug 26 15:28:16 ailcm-ai1 kernel: [87474.852181] NVRM: GPU 0000:18:00.0: GPU is on Board .
Aug 26 15:28:16 ailcm-ai1 kernel: [87474.852187] NVRM: GPU at PCI:0000:b4:00: GPU-de26af30-1122-e55a-1bb5-ed7732106434
Aug 26 15:28:16 ailcm-ai1 kernel: [87474.852189] NVRM: GPU Board Serial Number: 
Aug 26 15:28:16 ailcm-ai1 kernel: [87474.852191] NVRM: Xid (PCI:0000:b4:00): 79, pid=24955, GPU has fallen off the bus.
Aug 26 15:28:16 ailcm-ai1 kernel: [87474.852286] NVRM: GPU 0000:b4:00.0: GPU has fallen off the bus.
Aug 26 15:28:16 ailcm-ai1 kernel: [87474.852287] NVRM: GPU 0000:b4:00.0: GPU is on Board .
Aug 26 15:28:16 ailcm-ai1 kernel: [87474.937436] nvidia-gpu 0000:b4:00.3: Refused to change power state, currently in D3
Aug 26 15:28:16 ailcm-ai1 kernel: [87474.997587] nvidia-gpu 0000:18:00.3: Refused to change power state, currently in D3
Aug 26 15:28:16 ailcm-ai1 kernel: [87475.057545] nvidia-gpu 0000:65:00.3: Refused to change power state, currently in D3
Aug 26 15:28:17 ailcm-ai1 kernel: [87476.298279] nvidia-gpu 0000:18:00.3: i2c timeout error ffffffff
Aug 26 15:28:17 ailcm-ai1 kernel: [87476.298427] nvidia-gpu 0000:b4:00.3: i2c timeout error ffffffff
Aug 26 15:28:17 ailcm-ai1 kernel: [87476.298570] nvidia-gpu 0000:65:00.3: i2c timeout error ffffffff

Nvidia-smi

Unable to determine the device handle for GPU 0000:17:00.0: GPU is lost. Reboot the system to recover this GPU

Environment


Collecting environment information...
PyTorch version: 1.6.0
Is debug build: False
CUDA used to build PyTorch: 10.2

OS: Ubuntu 18.04.4 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: version 3.10.2

Python version: 3.6 (64-bit runtime)
Is CUDA available: False
CUDA runtime version: No CUDA
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA

Versions of relevant libraries:
[pip3] numpy==1.19.1
[pip3] torch==1.6.0
[pip3] torchvision==0.7.0
[conda] Could not collect

pbialecki · August 27, 2020, 6:16am

This is a cross-post from here.

I suggested to collect the nvidia-bug-report.sh and to create a topic in this board so that we could track it.
@generix would you have any ideas, what might be the potential root cause based on the attached log?
I don’t think it’s related to PyTorch, but might be a hardware/driver/temperature/PSU issue, but don’t know how to further isolate it.

Topic		Replies	Views
GPU has fallen off the bus issues on daily basis (RTX 4090) Linux pcie , cuda , ubuntu , rtx	8	461	December 12, 2024
GPU has fallen off the bus Linux	7	9634	September 12, 2023
Kernel panic when training with PyTorch & GTX1080Ti Frameworks kernel	0	701	September 9, 2021
GPU memory is empty, but CUDA out of memory error occurs CUDA Programming and Performance cuda	5	20196	September 19, 2024
Something goes wrong with PCIe and Ubuntu freezes only mouse can move but cannot click several times a day on dgx station v100 Linux pcie , cuda , kernel	6	764	January 6, 2023
GPU accelerated LAMMPS running for a while then stop with Cuda driver error 600 CUDA Setup and Installation	8	1706	October 1, 2020
K80 GPU disappears when tries to run 2 TensorFlow applications (one in each GPU) simultaneously. CUDA Programming and Performance	8	1860	August 1, 2017
GPU has fallen of the bus Linux	15	6975	July 19, 2019
Image Classification Pytorch Training Error TAO Toolkit cudnn	10	125	September 23, 2024
"GPU has fallen off the bus" while idle, only occurs when all displays powered off Linux	14	7503	October 29, 2024

Syncbatchnorm and DDP causes crash

Related topics