Running DDP with BatchSyncNorm.
The training will run for a couple of batches and the all GPUs fall off the bus.
The training runs fine without BatchSyncNorm.
This issue occurs in two models, deeplabv3 and another model, that I have tested so far.
Code
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '123553'
dist.init_process_group("gloo", rank=rank, world_size=world_size)
model = models[model](num_classes=ncls)
model = torch.nn.SyncBatchNorm.convert_sync_batchnorm(model)
model = model.cuda(device)
model = DDP(model, device_ids=[rank], find_unused_parameters=True)
optimizer = torch.optim.Adam(model.parameters())
E,B = epochs, len(trn_dtldr)
for e in range(E):
for b, batch in enumerate(trn_dtldr):
x,y = batch['x'].cuda(device), batch['y'].cuda(device)
py = model(x)
optimizer.zero_grad()
l = loss(py,y)
l.backward()
optimizer.step()
cleanup()
In the error log, this issue occurs at Aug 26 15:28:16
nvidia-bug-report.log.gz (2.7 MB)
Error Log Excerpt
Aug 26 15:27:32 ailcm-ai1 kernel: [87430.975714] nvidia-uvm: Loaded the UVM driver, major device number 511.
Aug 26 15:28:16 ailcm-ai1 kernel: [87474.832707] NVRM: GPU at PCI:0000:17:00: GPU-f114b259-1fd6-a53b-8882-7d59735e0271
Aug 26 15:28:16 ailcm-ai1 kernel: [87474.832733] NVRM: GPU Board Serial Number:
Aug 26 15:28:16 ailcm-ai1 kernel: [87474.832736] NVRM: Xid (PCI:0000:17:00): 79, pid=24953, GPU has fallen off the bus.
Aug 26 15:28:16 ailcm-ai1 kernel: [87474.850885] NVRM: GPU 0000:17:00.0: GPU has fallen off the bus.
Aug 26 15:28:16 ailcm-ai1 kernel: [87474.850886] NVRM: GPU 0000:17:00.0: GPU is on Board .
Aug 26 15:28:16 ailcm-ai1 kernel: [87474.850896] NVRM: A GPU crash dump has been created. If possible, please run
Aug 26 15:28:16 ailcm-ai1 kernel: [87474.850896] NVRM: nvidia-bug-report.sh as root to collect this data before
Aug 26 15:28:16 ailcm-ai1 kernel: [87474.850896] NVRM: the NVIDIA kernel module is unloaded.
Aug 26 15:28:16 ailcm-ai1 kernel: [87474.851013] NVRM: GPU at PCI:0000:65:00: GPU-ec0ca56f-dbab-4d7f-8341-0bfd7d5f8a24
Aug 26 15:28:16 ailcm-ai1 kernel: [87474.851016] NVRM: GPU Board Serial Number:
Aug 26 15:28:16 ailcm-ai1 kernel: [87474.851019] NVRM: Xid (PCI:0000:65:00): 79, pid=24955, GPU has fallen off the bus.
Aug 26 15:28:16 ailcm-ai1 kernel: [87474.851117] NVRM: GPU 0000:65:00.0: GPU has fallen off the bus.
Aug 26 15:28:16 ailcm-ai1 kernel: [87474.851119] NVRM: GPU 0000:65:00.0: GPU is on Board .
Aug 26 15:28:16 ailcm-ai1 kernel: [87474.852082] NVRM: GPU at PCI:0000:18:00: GPU-499e9a48-65fb-b44f-21ef-1475bafeba1e
Aug 26 15:28:16 ailcm-ai1 kernel: [87474.852083] NVRM: GPU Board Serial Number:
Aug 26 15:28:16 ailcm-ai1 kernel: [87474.852085] NVRM: Xid (PCI:0000:18:00): 79, pid=24955, GPU has fallen off the bus.
Aug 26 15:28:16 ailcm-ai1 kernel: [87474.852179] NVRM: GPU 0000:18:00.0: GPU has fallen off the bus.
Aug 26 15:28:16 ailcm-ai1 kernel: [87474.852181] NVRM: GPU 0000:18:00.0: GPU is on Board .
Aug 26 15:28:16 ailcm-ai1 kernel: [87474.852187] NVRM: GPU at PCI:0000:b4:00: GPU-de26af30-1122-e55a-1bb5-ed7732106434
Aug 26 15:28:16 ailcm-ai1 kernel: [87474.852189] NVRM: GPU Board Serial Number:
Aug 26 15:28:16 ailcm-ai1 kernel: [87474.852191] NVRM: Xid (PCI:0000:b4:00): 79, pid=24955, GPU has fallen off the bus.
Aug 26 15:28:16 ailcm-ai1 kernel: [87474.852286] NVRM: GPU 0000:b4:00.0: GPU has fallen off the bus.
Aug 26 15:28:16 ailcm-ai1 kernel: [87474.852287] NVRM: GPU 0000:b4:00.0: GPU is on Board .
Aug 26 15:28:16 ailcm-ai1 kernel: [87474.937436] nvidia-gpu 0000:b4:00.3: Refused to change power state, currently in D3
Aug 26 15:28:16 ailcm-ai1 kernel: [87474.997587] nvidia-gpu 0000:18:00.3: Refused to change power state, currently in D3
Aug 26 15:28:16 ailcm-ai1 kernel: [87475.057545] nvidia-gpu 0000:65:00.3: Refused to change power state, currently in D3
Aug 26 15:28:17 ailcm-ai1 kernel: [87476.298279] nvidia-gpu 0000:18:00.3: i2c timeout error ffffffff
Aug 26 15:28:17 ailcm-ai1 kernel: [87476.298427] nvidia-gpu 0000:b4:00.3: i2c timeout error ffffffff
Aug 26 15:28:17 ailcm-ai1 kernel: [87476.298570] nvidia-gpu 0000:65:00.3: i2c timeout error ffffffff
Nvidia-smi
Unable to determine the device handle for GPU 0000:17:00.0: GPU is lost. Reboot the system to recover this GPU
Environment
Collecting environment information...
PyTorch version: 1.6.0
Is debug build: False
CUDA used to build PyTorch: 10.2
OS: Ubuntu 18.04.4 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: version 3.10.2
Python version: 3.6 (64-bit runtime)
Is CUDA available: False
CUDA runtime version: No CUDA
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
Versions of relevant libraries:
[pip3] numpy==1.19.1
[pip3] torch==1.6.0
[pip3] torchvision==0.7.0
[conda] Could not collect