Re-training SSD-Mobilenet: gt_locations consist of nan values which causing Regression Loss to NaN

Hi All,
Im following the steps from the below link,

I’m training SSD-Mobilenet Model on Bosch Small Traffic Lights Dataset.

While training, my Avg Loss is reducing slowly but suddenly I’m getting NaN. I followed the following methods but the issue still persists.

  1. Error training with jetson-inference
    I have verified the image’s XML files and they look fine. Sometimes I’m not getting any NaN value for ‘epoch 0’
  2. Tuning learning rate i.e. 0.01, 0.001, 0.0001 etc
  3. Using ADAM Optimizer

But after enabling Pytorch’s Anomaly Detection i.e. torch.autograd.set_detect_anomaly(True), I’m able to find the instance and source of NaN. By further debugging, I have observed that one of the box locations in gt_location is having nan values (please refer to the following log)


image_id: 481834
predicted_locations: tensor([[ 1.4837, 1.2564, -6.5235, -2.5821],
[ 0.6447, 0.8457, -16.9513, -11.4073],
[ 2.0294, 0.9745, -15.5438, -14.0698],
[ 1.8593, 1.0754, -15.8804, -14.4709],
[ 2.0474, 1.3663, -15.7238, -14.4092]],
grad_fn=)
gt_locations: tensor([[ 25.0286, 15.6667, nan, nan],
[ 4.0797, 2.3779, -13.1398, -8.8714],
[ 4.1841, 2.5611, -14.6530, -13.4025],
[ 2.0534, 0.6725, -13.3843, -12.9900],
[ 3.5518, 0.3255, -14.6399, -13.4983]])
regression_loss: nan | classification_loss: 3.4250411987304688 | loss: nan
/usr/local/lib/python3.7/dist-packages/torch/autograd/init.py:175: UserWarning: Error detected in SmoothL1LossBackward0. Traceback of forward call that caused the error:
File “train_ssd.py”, line 409, in
train(train_loader, net, criterion, optimizer, device=DEVICE, debug_steps=args.debug_steps, epoch=epoch)
File “train_ssd.py”, line 148, in train
regression_loss, classification_loss = criterion(confidence, locations, labels, boxes)
File “/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py”, line 1130, in _call_impl
return forward_call(*input, **kwargs)
File “/content/gdrive/MyDrive/Colab Notebooks/Amrita/jetson-inference/python/training/detection/ssd/vision/nn/multibox_loss.py”, line 45, in forward
smooth_l1_loss = F.smooth_l1_loss(predicted_locations, gt_locations, size_average=False)
File “/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py”, line 3188, in smooth_l1_loss
return torch._C._nn.smooth_l1_loss(expanded_input, expanded_target, _Reduction.get_enum(reduction), beta)
(Triggered internally at …/torch/csrc/autograd/python_anomaly_mode.cpp:102.)
allow_unreachable=True, accumulate_grad=True) # Calls into the C++ engine to run the backward pass
Traceback (most recent call last):
File “train_ssd.py”, line 409, in
train(train_loader, net, criterion, optimizer, device=DEVICE, debug_steps=args.debug_steps, epoch=epoch)
File “train_ssd.py”, line 153, in train
loss.backward()
File “/usr/local/lib/python3.7/dist-packages/torch/_tensor.py”, line 396, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File “/usr/local/lib/python3.7/dist-packages/torch/autograd/init.py”, line 175, in backward
allow_unreachable=True, accumulate_grad=True) # Calls into the C++ engine to run the backward pass
RuntimeError: Function ‘SmoothL1LossBackward0’ returned nan values in its 0th output.


I think TrainAugmentation causing this issue but not sure. To verify that I want to disable Image Augmentation.
Can anyone please suggest how to do that?

Thank you in advance!

Hi @KhemSon, please see my reply to your GitHub post about this issue here: