I successfully run Moco v2 as described in their repository on a single GeForce RTX 3090 GPU machine.
However, when I run it on another machine with multiple GeForce RTX 3090 cards the training loop produces NaN values after some iterations in the first epoch (sometimes in the beginning and sometimes later).
I run exactly the same code (main_moco.py from the above repo) with the same pytorch version and virtual envs.
I added the line:
torch.autograd.set_detect_anomaly(mode=True, check_nan=True)
before the training loop in order to treat the NaN in loss as an Error. From the output it can be shown that is has the output of the Network is NaN, and it wasn’t became NaN because of exploding gradients in the back-propagation.
Error
.
.
.
Epoch: [0][ 1240/30451] Time 15.713 ( 4.406) Data 11.483 ( 3.431) Loss 9.3461e+00 (1.0118e+01) Acc@1 14.06 ( 11.95) Acc@5 34.38 ( 25.26)
Epoch: [0][ 1250/30451] Time 0.443 ( 4.397) Data 0.000 ( 3.423) Loss 9.2996e+00 (1.0112e+01) Acc@1 21.09 ( 12.00) Acc@5 40.62 ( 25.34)
Epoch: [0][ 1260/30451] Time 12.634 ( 4.396) Data 11.158 ( 3.422) Loss 9.2990e+00 (1.0105e+01) Acc@1 19.53 ( 12.06) Acc@5 42.97 ( 25.45)
/home/my-user/Projects/moco-v2/venv/lib/python3.8/site-packages/torch/autograd/__init__.py:197: UserWarning: Error detected in LogSoftmaxBackward0. Traceback of forward call that caused the error:
File "<string>", line 1, in <module>
File "/usr/lib/python3.8/multiprocessing/spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "/usr/lib/python3.8/multiprocessing/spawn.py", line 129, in _main
return self._bootstrap(parent_sentinel)
File "/usr/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/usr/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/home/my-user/Projects/moco-v2/venv/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File "/home/my-user/Projects/moco-v2/main_moco.py", line 290, in main_worker
train(train_loader, model, criterion, optimizer, epoch, args)
File "/home/my-user/Projects/moco-v2/main_moco.py", line 328, in train
loss = criterion(output, target)
File "/home/my-user/Projects/moco-v2/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/home/my-user/Projects/moco-v2/venv/lib/python3.8/site-packages/torch/nn/modules/loss.py", line 1174, in forward
return F.cross_entropy(input, target, weight=self.weight,
File "/home/my-user/Projects/moco-v2/venv/lib/python3.8/site-packages/torch/nn/functional.py", line 3026, in cross_entropy
return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
File "/home/my-user/Projects/moco-v2/venv/lib/python3.8/site-packages/torch/fx/traceback.py", line 57, in format_stack
return traceback.format_stack()
(Triggered internally at ../torch/csrc/autograd/python_anomaly_mode.cpp:114.)
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
Traceback (most recent call last):
File "main_moco.py", line 432, in <module>
main()
File "main_moco.py", line 139, in main
mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args))
File "/home/my-user/Projects/moco-v2/venv/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/my-user/Projects/moco-v2/venv/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
while not context.join():
File "/home/my-user/Projects/moco-v2/venv/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 160, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/home/my-user/Projects/moco-v2/venv/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File "/home/my-user/Projects/moco-v2/main_moco.py", line 290, in main_worker
train(train_loader, model, criterion, optimizer, epoch, args)
File "/home/my-user/Projects/moco-v2/main_moco.py", line 341, in train
loss.backward()
File "/home/my-user/Projects/moco-v2/venv/lib/python3.8/site-packages/torch/_tensor.py", line 487, in backward
torch.autograd.backward(
File "/home/my-user/Projects/moco-v2/venv/lib/python3.8/site-packages/torch/autograd/__init__.py", line 197, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: Function 'LogSoftmaxBackward0' returned nan values in its 0th output.
Environments
Single GPU machine:
(venv) user@computer:~/Projects/moco-v2$ python -V
Python 3.8.10
(venv) user@computer:~/Projects/moco-v2$ pip freeze
certifi==2022.12.7
charset-normalizer==2.1.1
click==8.1.3
docker-pycreds==0.4.0
gitdb==4.0.10
GitPython==3.1.29
idna==3.4
numpy==1.23.5
nvidia-cublas-cu11==11.10.3.66
nvidia-cuda-nvrtc-cu11==11.7.99
nvidia-cuda-runtime-cu11==11.7.99
nvidia-cudnn-cu11==8.5.0.96
pathtools==0.1.2
Pillow==9.3.0
promise==2.3
protobuf==4.21.11
psutil==5.9.4
PyYAML==6.0
requests==2.28.1
sentry-sdk==1.11.1
setproctitle==1.3.2
shortuuid==1.0.11
six==1.16.0
smmap==5.0.0
torch==1.13.0
torchvision==0.14.0
tqdm==4.64.1
typing-extensions==4.4.0
urllib3==1.26.13
wandb==0.13.6
(venv) user@computer:~/Projects/moco-v2$ python
Python 3.8.10 (default, Nov 14 2022, 12:59:47)
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.version.cuda
'11.7'
>>> torch.cuda.device_count()
1
>>> torch.cuda.get_device_properties()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: get_device_properties() missing 1 required positional argument: 'device'
>>> torch.cuda.get_device_properties(0)
_CudaDeviceProperties(name='NVIDIA GeForce RTX 3090', major=8, minor=6, total_memory=24259MB, multi_processor_count=82)
>>>
Multi GPU machine
(venv) user@computer:~/Projects/moco-v2$ python -V
Python 3.8.10
(venv) user@computer:~/Projects/moco-v2$ pip freeze
certifi==2022.12.7
charset-normalizer==2.1.1
click==8.1.3
docker-pycreds==0.4.0
gitdb==4.0.10
GitPython==3.1.29
idna==3.4
numpy==1.23.5
nvidia-cublas-cu11==11.10.3.66
nvidia-cuda-nvrtc-cu11==11.7.99
nvidia-cuda-runtime-cu11==11.7.99
nvidia-cudnn-cu11==8.5.0.96
pathtools==0.1.2
Pillow==9.3.0
pkg_resources==0.0.0
promise==2.3
protobuf==4.21.11
psutil==5.9.4
PyYAML==6.0
requests==2.28.1
sentry-sdk==1.11.1
setproctitle==1.3.2
shortuuid==1.0.11
six==1.16.0
smmap==5.0.0
torch==1.13.0
torchvision==0.14.0
tqdm==4.64.1
typing_extensions==4.4.0
urllib3==1.26.13
wandb==0.13.6
(venv) user@computer:~/Projects/moco-v2$ python
Python 3.8.10 (default, Nov 14 2022, 12:59:47)
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.version.cuda
'11.7'
>>> torch.cuda.device_count
<functools._lru_cache_wrapper object at 0x7ff2afb44040>
>>> torch.cuda.device_count()
2
>>> torch.cuda.get_device_properties(0)
_CudaDeviceProperties(name='NVIDIA GeForce RTX 3090', major=8, minor=6, total_memory=24259MB, multi_processor_count=82)
>>> torch.cuda.get_device_properties(1)
_CudaDeviceProperties(name='NVIDIA GeForce RTX 3090', major=8, minor=6, total_memory=24258MB, multi_processor_count=82)
>>>