Re-training SSD-Mobilenet: gt_locations consist of nan values which causing Regression Loss to NaN

KhemSon · September 13, 2022, 2:22pm

Hi All,
Im following the steps from the below link,

dusty-nv/jetson-inference/blob/master/docs/pytorch-ssd.md

<img src="https://github.com/dusty-nv/jetson-inference/raw/master/docs/images/deep-vision-header.jpg" width="100%">
<p align="right"><sup><a href="pytorch-collect.md">Back</a> | <a href="pytorch-collect-detection.md">Next</a> | </sup><a href="../README.md#hello-ai-world"><sup>Contents</sup></a>
<br/>
<sup>Transfer Learning - Object Detection</sup></s></p>

# Re-training SSD-Mobilenet

Next, we'll train our own SSD-Mobilenet object detection model using PyTorch and the [Open Images](https://storage.googleapis.com/openimages/web/visualizer/index.html?set=train&type=detection&c=%2Fm%2F06l9r) dataset.  SSD-Mobilenet is a popular network architecture for realtime object detection on mobile and embedded devices that combines the [SSD-300](https://arxiv.org/abs/1512.02325) Single-Shot MultiBox Detector with a [Mobilenet](https://arxiv.org/abs/1704.04861) backbone.  

<a href="https://arxiv.org/abs/1512.02325"><img src="https://github.com/dusty-nv/jetson-inference/raw/dev/docs/images/pytorch-ssd-mobilenet.jpg"></a>

In the example below, we'll train a custom detection model that locates 8 different varieties of fruit, although you are welcome to pick from any of the [600 classes](https://github.com/dusty-nv/pytorch-ssd/blob/master/open_images_classes.txt) in the Open Images dataset to train your model on.  You can visually browse the dataset [here](https://storage.googleapis.com/openimages/web/visualizer/index.html?set=train&type=detection).

<img src="https://github.com/dusty-nv/jetson-inference/raw/dev/docs/images/pytorch-fruit.jpg">

To get started, first make sure that you have [JetPack 4.4](https://developer.nvidia.com/embedded/jetpack) or newer and [PyTorch installed](pytorch-transfer-learning.md#installing-pytorch) for **Python 3.6** on your Jetson.  JetPack 4.4 includes TensorRT 7.1, which is the minimum TensorRT version that supports loading SSD-Mobilenet via ONNX.  And the PyTorch training scripts used for training SSD-Mobilenet are for Python3, so PyTorch should be installed for Python 3.6.

## Setup

> **note:** first make sure that you have [JetPack 4.4](https://developer.nvidia.com/embedded/jetpack) or newer on your Jetson and [PyTorch installed](pytorch-transfer-learning.md#installing-pytorch) for **Python 3.6**

This file has been truncated. show original

I’m training SSD-Mobilenet Model on Bosch Small Traffic Lights Dataset.

While training, my Avg Loss is reducing slowly but suddenly I’m getting NaN. I followed the following methods but the issue still persists.

Error training with jetson-inference
I have verified the image’s XML files and they look fine. Sometimes I’m not getting any NaN value for ‘epoch 0’
Tuning learning rate i.e. 0.01, 0.001, 0.0001 etc
Using ADAM Optimizer

But after enabling Pytorch’s Anomaly Detection i.e. torch.autograd.set_detect_anomaly(True), I’m able to find the instance and source of NaN. By further debugging, I have observed that one of the box locations in gt_location is having nan values (please refer to the following log)

image_id: 481834
predicted_locations: tensor([[ 1.4837, 1.2564, -6.5235, -2.5821],
[ 0.6447, 0.8457, -16.9513, -11.4073],
[ 2.0294, 0.9745, -15.5438, -14.0698],
[ 1.8593, 1.0754, -15.8804, -14.4709],
[ 2.0474, 1.3663, -15.7238, -14.4092]],
grad_fn=)
gt_locations: tensor([[ 25.0286, 15.6667, nan, nan],
[ 4.0797, 2.3779, -13.1398, -8.8714],
[ 4.1841, 2.5611, -14.6530, -13.4025],
[ 2.0534, 0.6725, -13.3843, -12.9900],
[ 3.5518, 0.3255, -14.6399, -13.4983]])
regression_loss: nan | classification_loss: 3.4250411987304688 | loss: nan
/usr/local/lib/python3.7/dist-packages/torch/autograd/init.py:175: UserWarning: Error detected in SmoothL1LossBackward0. Traceback of forward call that caused the error:
File “train_ssd.py”, line 409, in
train(train_loader, net, criterion, optimizer, device=DEVICE, debug_steps=args.debug_steps, epoch=epoch)
File “train_ssd.py”, line 148, in train
regression_loss, classification_loss = criterion(confidence, locations, labels, boxes)
File “/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py”, line 1130, in _call_impl
return forward_call(*input, **kwargs)
File “/content/gdrive/MyDrive/Colab Notebooks/Amrita/jetson-inference/python/training/detection/ssd/vision/nn/multibox_loss.py”, line 45, in forward
smooth_l1_loss = F.smooth_l1_loss(predicted_locations, gt_locations, size_average=False)
File “/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py”, line 3188, in smooth_l1_loss
return torch._C._nn.smooth_l1_loss(expanded_input, expanded_target, _Reduction.get_enum(reduction), beta)
(Triggered internally at …/torch/csrc/autograd/python_anomaly_mode.cpp:102.)
allow_unreachable=True, accumulate_grad=True) # Calls into the C++ engine to run the backward pass
Traceback (most recent call last):
File “train_ssd.py”, line 409, in
train(train_loader, net, criterion, optimizer, device=DEVICE, debug_steps=args.debug_steps, epoch=epoch)
File “train_ssd.py”, line 153, in train
loss.backward()
File “/usr/local/lib/python3.7/dist-packages/torch/_tensor.py”, line 396, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File “/usr/local/lib/python3.7/dist-packages/torch/autograd/init.py”, line 175, in backward
allow_unreachable=True, accumulate_grad=True) # Calls into the C++ engine to run the backward pass
RuntimeError: Function ‘SmoothL1LossBackward0’ returned nan values in its 0th output.

I think TrainAugmentation causing this issue but not sure. To verify that I want to disable Image Augmentation.
Can anyone please suggest how to do that?

Thank you in advance!

dusty_nv · September 13, 2022, 7:28pm

Hi @KhemSon, please see my reply to your GitHub post about this issue here:

github.com/dusty-nv/jetson-inference

Re-training SSD-Mobilenet: gt_locations consist of nan values which causing Regression Loss to NaN

opened 01:55PM - 13 Sep 22 UTC

KhemSon

Hi, I'm training SSD-Mobilenet Model on Bosch Small Traffic Lights Dataset. … While training, my Avg Loss is reducing slowly but suddenly I'm getting NaN. I followed the following methods but the issue still persists. 1. https://forums.developer.nvidia.com/t/error-training-with-jetson-inference/210095 I have verified the image's XML files and they look fine. Sometimes I'm not getting any NaN value for 'epoch 0' 2. Tuning learning rate i.e. 0.01, 0.001, 0.0001 etc 3. Using ADAM Optimizer But after enabling Pytorch's Anomaly Detection i.e. torch.autograd.set_detect_anomaly(True), I'm able to find the instance and source of NaN. By further debugging, I have observed that one of the box locations in gt_location is having nan values (please refer to the following log) ___________________________________________________________________________________________________________________________________________ image_id: 481834 predicted_locations: tensor([[ 1.4837, 1.2564, -6.5235, -2.5821], [ 0.6447, 0.8457, -16.9513, -11.4073], [ 2.0294, 0.9745, -15.5438, -14.0698], [ 1.8593, 1.0754, -15.8804, -14.4709], [ 2.0474, 1.3663, -15.7238, -14.4092]], grad_fn=<ReshapeAliasBackward0>) gt_locations: tensor([[ 25.0286, 15.6667, nan, nan], [ 4.0797, 2.3779, -13.1398, -8.8714], [ 4.1841, 2.5611, -14.6530, -13.4025], [ 2.0534, 0.6725, -13.3843, -12.9900], [ 3.5518, 0.3255, -14.6399, -13.4983]]) regression_loss: nan | classification_loss: 3.4250411987304688 | loss: nan /usr/local/lib/python3.7/dist-packages/torch/autograd/__init__.py:175: UserWarning: Error detected in SmoothL1LossBackward0. Traceback of forward call that caused the error: File "train_ssd.py", line 409, in <module> train(train_loader, net, criterion, optimizer, device=DEVICE, debug_steps=args.debug_steps, epoch=epoch) File "train_ssd.py", line 148, in train regression_loss, classification_loss = criterion(confidence, locations, labels, boxes) File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/content/gdrive/MyDrive/Colab Notebooks/Amrita/jetson-inference/python/training/detection/ssd/vision/nn/multibox_loss.py", line 45, in forward smooth_l1_loss = F.smooth_l1_loss(predicted_locations, gt_locations, size_average=False) File "/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py", line 3188, in smooth_l1_loss return torch._C._nn.smooth_l1_loss(expanded_input, expanded_target, _Reduction.get_enum(reduction), beta) (Triggered internally at ../torch/csrc/autograd/python_anomaly_mode.cpp:102.) allow_unreachable=True, accumulate_grad=True) # Calls into the C++ engine to run the backward pass Traceback (most recent call last): File "train_ssd.py", line 409, in <module> train(train_loader, net, criterion, optimizer, device=DEVICE, debug_steps=args.debug_steps, epoch=epoch) File "train_ssd.py", line 153, in train loss.backward() File "/usr/local/lib/python3.7/dist-packages/torch/_tensor.py", line 396, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "/usr/local/lib/python3.7/dist-packages/torch/autograd/__init__.py", line 175, in backward allow_unreachable=True, accumulate_grad=True) # Calls into the C++ engine to run the backward pass RuntimeError: Function 'SmoothL1LossBackward0' returned nan values in its 0th output. ___________________________________________________________________________________________________________________________________________ I think TrainAugmentation causing this issue but not sure. To verify that I want to disable Image Augmentation. @dusty-nv could you please let me know how to do that? Thank you in advance!

system · September 27, 2022, 7:28pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.