Training MoCo works on a single GeForce RTX 3090 machine but fails on a multi-GPU machine due to NaN values

I successfully run Moco v2 as described in their repository on a single GeForce RTX 3090 GPU machine.
However, when I run it on another machine with multiple GeForce RTX 3090 cards the training loop produces NaN values after some iterations in the first epoch (sometimes in the beginning and sometimes later).
I run exactly the same code (main_moco.py from the above repo) with the same pytorch version and virtual envs.

I added the line:

torch.autograd.set_detect_anomaly(mode=True, check_nan=True)

before the training loop in order to treat the NaN in loss as an Error. From the output it can be shown that is has the output of the Network is NaN, and it wasn’t became NaN because of exploding gradients in the back-propagation.

Error

.
.
.
Epoch: [0][ 1240/30451]	Time 15.713 ( 4.406)	Data 11.483 ( 3.431)	Loss 9.3461e+00 (1.0118e+01)	Acc@1  14.06 ( 11.95)	Acc@5  34.38 ( 25.26)
Epoch: [0][ 1250/30451]	Time  0.443 ( 4.397)	Data  0.000 ( 3.423)	Loss 9.2996e+00 (1.0112e+01)	Acc@1  21.09 ( 12.00)	Acc@5  40.62 ( 25.34)
Epoch: [0][ 1260/30451]	Time 12.634 ( 4.396)	Data 11.158 ( 3.422)	Loss 9.2990e+00 (1.0105e+01)	Acc@1  19.53 ( 12.06)	Acc@5  42.97 ( 25.45)
/home/my-user/Projects/moco-v2/venv/lib/python3.8/site-packages/torch/autograd/__init__.py:197: UserWarning: Error detected in LogSoftmaxBackward0. Traceback of forward call that caused the error:
  File "<string>", line 1, in <module>
  File "/usr/lib/python3.8/multiprocessing/spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "/usr/lib/python3.8/multiprocessing/spawn.py", line 129, in _main
    return self._bootstrap(parent_sentinel)
  File "/usr/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/usr/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/my-user/Projects/moco-v2/venv/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/home/my-user/Projects/moco-v2/main_moco.py", line 290, in main_worker
    train(train_loader, model, criterion, optimizer, epoch, args)
  File "/home/my-user/Projects/moco-v2/main_moco.py", line 328, in train
    loss = criterion(output, target)
  File "/home/my-user/Projects/moco-v2/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/my-user/Projects/moco-v2/venv/lib/python3.8/site-packages/torch/nn/modules/loss.py", line 1174, in forward
    return F.cross_entropy(input, target, weight=self.weight,
  File "/home/my-user/Projects/moco-v2/venv/lib/python3.8/site-packages/torch/nn/functional.py", line 3026, in cross_entropy
    return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
  File "/home/my-user/Projects/moco-v2/venv/lib/python3.8/site-packages/torch/fx/traceback.py", line 57, in format_stack
    return traceback.format_stack()
 (Triggered internally at ../torch/csrc/autograd/python_anomaly_mode.cpp:114.)
  Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
Traceback (most recent call last):
  File "main_moco.py", line 432, in <module>
    main()
  File "main_moco.py", line 139, in main
    mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args))
  File "/home/my-user/Projects/moco-v2/venv/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/my-user/Projects/moco-v2/venv/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
    while not context.join():
  File "/home/my-user/Projects/moco-v2/venv/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 160, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/home/my-user/Projects/moco-v2/venv/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/home/my-user/Projects/moco-v2/main_moco.py", line 290, in main_worker
    train(train_loader, model, criterion, optimizer, epoch, args)
  File "/home/my-user/Projects/moco-v2/main_moco.py", line 341, in train
    loss.backward()
  File "/home/my-user/Projects/moco-v2/venv/lib/python3.8/site-packages/torch/_tensor.py", line 487, in backward
    torch.autograd.backward(
  File "/home/my-user/Projects/moco-v2/venv/lib/python3.8/site-packages/torch/autograd/__init__.py", line 197, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: Function 'LogSoftmaxBackward0' returned nan values in its 0th output.

Environments

Single GPU machine:

(venv) user@computer:~/Projects/moco-v2$ python -V
Python 3.8.10
(venv) user@computer:~/Projects/moco-v2$ pip freeze
certifi==2022.12.7
charset-normalizer==2.1.1
click==8.1.3
docker-pycreds==0.4.0
gitdb==4.0.10
GitPython==3.1.29
idna==3.4
numpy==1.23.5
nvidia-cublas-cu11==11.10.3.66
nvidia-cuda-nvrtc-cu11==11.7.99
nvidia-cuda-runtime-cu11==11.7.99
nvidia-cudnn-cu11==8.5.0.96
pathtools==0.1.2
Pillow==9.3.0
promise==2.3
protobuf==4.21.11
psutil==5.9.4
PyYAML==6.0
requests==2.28.1
sentry-sdk==1.11.1
setproctitle==1.3.2
shortuuid==1.0.11
six==1.16.0
smmap==5.0.0
torch==1.13.0
torchvision==0.14.0
tqdm==4.64.1
typing-extensions==4.4.0
urllib3==1.26.13
wandb==0.13.6
(venv) user@computer:~/Projects/moco-v2$ python
Python 3.8.10 (default, Nov 14 2022, 12:59:47)
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.version.cuda
'11.7'
>>> torch.cuda.device_count()
1
>>> torch.cuda.get_device_properties()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: get_device_properties() missing 1 required positional argument: 'device'
>>> torch.cuda.get_device_properties(0)
_CudaDeviceProperties(name='NVIDIA GeForce RTX 3090', major=8, minor=6, total_memory=24259MB, multi_processor_count=82)
>>> 

Multi GPU machine

(venv) user@computer:~/Projects/moco-v2$ python -V
Python 3.8.10
(venv) user@computer:~/Projects/moco-v2$ pip freeze
certifi==2022.12.7
charset-normalizer==2.1.1
click==8.1.3
docker-pycreds==0.4.0
gitdb==4.0.10
GitPython==3.1.29
idna==3.4
numpy==1.23.5
nvidia-cublas-cu11==11.10.3.66
nvidia-cuda-nvrtc-cu11==11.7.99
nvidia-cuda-runtime-cu11==11.7.99
nvidia-cudnn-cu11==8.5.0.96
pathtools==0.1.2
Pillow==9.3.0
pkg_resources==0.0.0
promise==2.3
protobuf==4.21.11
psutil==5.9.4
PyYAML==6.0
requests==2.28.1
sentry-sdk==1.11.1
setproctitle==1.3.2
shortuuid==1.0.11
six==1.16.0
smmap==5.0.0
torch==1.13.0
torchvision==0.14.0
tqdm==4.64.1
typing_extensions==4.4.0
urllib3==1.26.13
wandb==0.13.6
(venv) user@computer:~/Projects/moco-v2$ python
Python 3.8.10 (default, Nov 14 2022, 12:59:47) 
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.version.cuda
'11.7'
>>> torch.cuda.device_count
<functools._lru_cache_wrapper object at 0x7ff2afb44040>
>>> torch.cuda.device_count()
2
>>> torch.cuda.get_device_properties(0)
_CudaDeviceProperties(name='NVIDIA GeForce RTX 3090', major=8, minor=6, total_memory=24259MB, multi_processor_count=82)
>>> torch.cuda.get_device_properties(1)
_CudaDeviceProperties(name='NVIDIA GeForce RTX 3090', major=8, minor=6, total_memory=24258MB, multi_processor_count=82)
>>>