Hi All,
Im following the steps from the below link,
I’m training SSD-Mobilenet Model on Bosch Small Traffic Lights Dataset.
While training, my Avg Loss is reducing slowly but suddenly I’m getting NaN. I followed the following methods but the issue still persists.
-
Error training with jetson-inference
I have verified the image’s XML files and they look fine. Sometimes I’m not getting any NaN value for ‘epoch 0’ - Tuning learning rate i.e. 0.01, 0.001, 0.0001 etc
- Using ADAM Optimizer
But after enabling Pytorch’s Anomaly Detection i.e. torch.autograd.set_detect_anomaly(True), I’m able to find the instance and source of NaN. By further debugging, I have observed that one of the box locations in gt_location is having nan values (please refer to the following log)
image_id: 481834
predicted_locations: tensor([[ 1.4837, 1.2564, -6.5235, -2.5821],
[ 0.6447, 0.8457, -16.9513, -11.4073],
[ 2.0294, 0.9745, -15.5438, -14.0698],
[ 1.8593, 1.0754, -15.8804, -14.4709],
[ 2.0474, 1.3663, -15.7238, -14.4092]],
grad_fn=)
gt_locations: tensor([[ 25.0286, 15.6667, nan, nan],
[ 4.0797, 2.3779, -13.1398, -8.8714],
[ 4.1841, 2.5611, -14.6530, -13.4025],
[ 2.0534, 0.6725, -13.3843, -12.9900],
[ 3.5518, 0.3255, -14.6399, -13.4983]])
regression_loss: nan | classification_loss: 3.4250411987304688 | loss: nan
/usr/local/lib/python3.7/dist-packages/torch/autograd/init.py:175: UserWarning: Error detected in SmoothL1LossBackward0. Traceback of forward call that caused the error:
File “train_ssd.py”, line 409, in
train(train_loader, net, criterion, optimizer, device=DEVICE, debug_steps=args.debug_steps, epoch=epoch)
File “train_ssd.py”, line 148, in train
regression_loss, classification_loss = criterion(confidence, locations, labels, boxes)
File “/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py”, line 1130, in _call_impl
return forward_call(*input, **kwargs)
File “/content/gdrive/MyDrive/Colab Notebooks/Amrita/jetson-inference/python/training/detection/ssd/vision/nn/multibox_loss.py”, line 45, in forward
smooth_l1_loss = F.smooth_l1_loss(predicted_locations, gt_locations, size_average=False)
File “/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py”, line 3188, in smooth_l1_loss
return torch._C._nn.smooth_l1_loss(expanded_input, expanded_target, _Reduction.get_enum(reduction), beta)
(Triggered internally at …/torch/csrc/autograd/python_anomaly_mode.cpp:102.)
allow_unreachable=True, accumulate_grad=True) # Calls into the C++ engine to run the backward pass
Traceback (most recent call last):
File “train_ssd.py”, line 409, in
train(train_loader, net, criterion, optimizer, device=DEVICE, debug_steps=args.debug_steps, epoch=epoch)
File “train_ssd.py”, line 153, in train
loss.backward()
File “/usr/local/lib/python3.7/dist-packages/torch/_tensor.py”, line 396, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File “/usr/local/lib/python3.7/dist-packages/torch/autograd/init.py”, line 175, in backward
allow_unreachable=True, accumulate_grad=True) # Calls into the C++ engine to run the backward pass
RuntimeError: Function ‘SmoothL1LossBackward0’ returned nan values in its 0th output.
I think TrainAugmentation causing this issue but not sure. To verify that I want to disable Image Augmentation.
Can anyone please suggest how to do that?
Thank you in advance!