Training with "train_ssd.py" - error at the end of the dataset

Florian_Faltin · September 29, 2021, 8:56am

Hi there,

I re-trained the SSD-Mobilenet network according to the description here and a set of images from the open-images database:

github.com

dusty-nv/jetson-inference/blob/master/docs/pytorch-ssd.md

<img src="https://github.com/dusty-nv/jetson-inference/raw/master/docs/images/deep-vision-header.jpg" width="100%">
<p align="right"><sup><a href="pytorch-collect.md">Back</a> | <a href="pytorch-collect-detection.md">Next</a> | </sup><a href="../README.md#hello-ai-world"><sup>Contents</sup></a>
<br/>
<sup>Transfer Learning - Object Detection</sup></s></p>

# Re-training SSD-Mobilenet

Next, we'll train our own SSD-Mobilenet object detection model using PyTorch and the [Open Images](https://storage.googleapis.com/openimages/web/visualizer/index.html?set=train&type=detection&c=%2Fm%2F06l9r) dataset.  SSD-Mobilenet is a popular network architecture for realtime object detection on mobile and embedded devices that combines the [SSD-300](https://arxiv.org/abs/1512.02325) Single-Shot MultiBox Detector with a [Mobilenet](https://arxiv.org/abs/1704.04861) backbone.  

<a href="https://arxiv.org/abs/1512.02325"><img src="https://github.com/dusty-nv/jetson-inference/raw/dev/docs/images/pytorch-ssd-mobilenet.jpg"></a>

In the example below, we'll train a custom detection model that locates 8 different varieties of fruit, although you are welcome to pick from any of the [600 classes](https://github.com/dusty-nv/pytorch-ssd/blob/master/open_images_classes.txt) in the Open Images dataset to train your model on.  You can visually browse the dataset [here](https://storage.googleapis.com/openimages/web/visualizer/index.html?set=train&type=detection).

<img src="https://github.com/dusty-nv/jetson-inference/raw/dev/docs/images/pytorch-fruit.jpg">

To get started, first make sure that you have [JetPack 4.4](https://developer.nvidia.com/embedded/jetpack) or newer and [PyTorch installed](pytorch-transfer-learning.md#installing-pytorch) for **Python 3.6** on your Jetson.  JetPack 4.4 includes TensorRT 7.1, which is the minimum TensorRT version that supports loading SSD-Mobilenet via ONNX.  And the PyTorch training scripts used for training SSD-Mobilenet are for Python3, so PyTorch should be installed for Python 3.6.

## Setup

> **note:** first make sure that you have [JetPack 4.4](https://developer.nvidia.com/embedded/jetpack) or newer on your Jetson and [PyTorch installed](pytorch-transfer-learning.md#installing-pytorch) for **Python 3.6**

This file has been truncated. show original

That worked out without any issues.

Now I try to do the same again with this dataset:

I already solved some issues to get the training started. But at the end of the first run when all images are processed, I get an issue:

python3 train_ssd.py --dataset-type=voc --data=data/shwd/VOC2028/ --model-dir=models/shwd --batch-size=4 --epochs=30
2021-09-29 10:39:22 - Using CUDA...
2021-09-29 10:39:22 - Namespace(balance_data=False, base_net=None, base_net_lr=0.001, batch_size=4, checkpoint_folder='models/shwd', dataset_type='voc', datasets=['data/shwd/VOC2028/'], debug_steps=10, extra_layers_lr=None, freeze_base_net=False, freeze_net=False, gamma=0.1, lr=0.01, mb2_width_mult=1.0, milestones='80,100', momentum=0.9, net='mb1-ssd', num_epochs=30, num_workers=2, pretrained_ssd='models/mobilenet-v1-ssd-mp-0_675.pth', resume=None, scheduler='cosine', t_max=100, use_cuda=True, validation_epochs=1, weight_decay=0.0005)
2021-09-29 10:39:22 - Prepare training datasets.
2021-09-29 10:39:24 - VOC Labels read from file: ('BACKGROUND', 'hat', 'person')
2021-09-29 10:39:24 - Stored labels into file models/shwd/labels.txt.
2021-09-29 10:39:24 - Train dataset size: 6064
2021-09-29 10:39:24 - Prepare Validation datasets.
2021-09-29 10:39:25 - VOC Labels read from file: ('BACKGROUND', 'hat', 'person')
2021-09-29 10:39:25 - Validation dataset size: 1517
2021-09-29 10:39:25 - Build network.
2021-09-29 10:39:25 - Init from pretrained ssd models/mobilenet-v1-ssd-mp-0_675.pth
2021-09-29 10:39:25 - Took 0.10 seconds to load the model.
2021-09-29 10:39:29 - Learning rate: 0.01, Base net learning rate: 0.001, Extra Layers learning rate: 0.01.
2021-09-29 10:39:29 - Uses CosineAnnealingLR scheduler.
2021-09-29 10:39:29 - Start training from epoch 0.
/home/emsys/.local/lib/python3.6/site-packages/torch/optim/lr_scheduler.py:123: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
  "https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)
/home/emsys/.local/lib/python3.6/site-packages/torch/nn/_reduction.py:44: UserWarning: size_average and reduce args will be deprecated, please use reduction='sum' instead.
  warnings.warn(warning.format(ret))
2021-09-29 10:39:40 - Epoch: 0, Step: 10/1516, Avg Loss: 15.1348, Avg Regression Loss 9.6635, Avg Classification Loss: 5.4713
2021-09-29 10:39:42 - Epoch: 0, Step: 20/1516, Avg Loss: 9.5630, Avg Regression Loss 5.6885, Avg Classification Loss: 3.8745
2021-09-29 10:39:44 - Epoch: 0, Step: 30/1516, Avg Loss: 9.4334, Avg Regression Loss 5.8865, Avg Classification Loss: 3.5469
2021-09-29 10:39:47 - Epoch: 0, Step: 40/1516, Avg Loss: 7.9035, Avg Regression Loss 4.2629, Avg Classification Loss: 3.6406

...

2021-09-29 10:45:50 - Epoch: 0, Step: 1500/1516, Avg Loss: 4.1119, Avg Regression Loss 1.7915, Avg Classification Loss: 2.3204
2021-09-29 10:45:52 - Epoch: 0, Step: 1510/1516, Avg Loss: 4.3096, Avg Regression Loss 2.0656, Avg Classification Loss: 2.2440
Traceback (most recent call last):
  File "train_ssd.py", line 346, in <module>
    val_loss, val_regression_loss, val_classification_loss = test(val_loader, net, criterion, DEVICE)
  File "train_ssd.py", line 150, in test
    for _, data in enumerate(loader):
  File "/home/emsys/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 363, in __next__
    data = self._next_data()
  File "/home/emsys/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 989, in _next_data
    return self._process_data(data)
  File "/home/emsys/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 1014, in _process_data
    data.reraise()
  File "/home/emsys/.local/lib/python3.6/site-packages/torch/_utils.py", line 395, in reraise
    raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/emsys/.local/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 185, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/emsys/.local/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/emsys/.local/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/emsys/jetson-inference/python/training/detection/ssd/vision/datasets/voc_dataset.py", line 83, in __getitem__
    boxes, labels = self.target_transform(boxes, labels)
  File "/home/emsys/jetson-inference/python/training/detection/ssd/vision/ssd/ssd.py", line 155, in __call__
    self.corner_form_priors, self.iou_threshold)
  File "/home/emsys/jetson-inference/python/training/detection/ssd/vision/utils/box_utils.py", line 167, in assign_priors
    best_target_per_prior, best_target_per_prior_index = ious.max(1)
RuntimeError: cannot perform reduction function max on tensor with no elements because the operation does not have an identity

I couldn’t find any solution on this error. Could somebody help me, please?

Thanks
Florian

dusty_nv · September 29, 2021, 2:48pm

Hi @Florian_Faltin, I recommend to uncomment this line of code, which will print out the ID/metadata of each image as it loads it:

https://github.com/dusty-nv/pytorch-ssd/blob/8ed842a408f8c4a8812f430cf8063e0b93a56803/vision/datasets/voc_dataset.py#L76

Also run it with --batch-size=1 --workers=0 --debug-steps=1
Then look at the last image that was loaded before the exception occurs, and probably remove this from the dataset’s ImageLists.

Florian_Faltin · September 30, 2021, 7:22am

Thank you @dusty_nv for the prompt reply.
This is the output with the modified script and command:

$ python3 train_ssd.py --dataset-type=voc --data=data/shwd/VOC2028/ --model-dir=models/shwd/ --batch-size=1 --workers=0 --debug-steps=1
2021-09-30 08:14:58 - Using CUDA...
2021-09-30 08:14:58 - Namespace(balance_data=False, base_net=None, base_net_lr=0.001, batch_size=1, checkpoint_folder='models/shwd/', dataset_type='voc', datasets=['data/shwd/VOC2028/'], debug_steps=1, extra_layers_lr=None, freeze_base_net=False, freeze_net=False, gamma=0.1, lr=0.01, mb2_width_mult=1.0, milestones='80,100', momentum=0.9, net='mb1-ssd', num_epochs=30, num_workers=0, pretrained_ssd='models/mobilenet-v1-ssd-mp-0_675.pth', resume=None, scheduler='cosine', t_max=100, use_cuda=True, validation_epochs=1, weight_decay=0.0005)
2021-09-30 08:14:58 - Prepare training datasets.
2021-09-30 08:15:00 - VOC Labels read from file: ('BACKGROUND', 'hat', 'person')
2021-09-30 08:15:00 - Stored labels into file models/shwd/labels.txt.
2021-09-30 08:15:00 - Train dataset size: 6064
2021-09-30 08:15:00 - Prepare Validation datasets.
2021-09-30 08:15:01 - VOC Labels read from file: ('BACKGROUND', 'hat', 'person')
2021-09-30 08:15:01 - Validation dataset size: 1517
2021-09-30 08:15:01 - Build network.
2021-09-30 08:15:01 - Init from pretrained ssd models/mobilenet-v1-ssd-mp-0_675.pth
2021-09-30 08:15:01 - Took 0.09 seconds to load the model.
2021-09-30 08:15:05 - Learning rate: 0.01, Base net learning rate: 0.001, Extra Layers learning rate: 0.01.
2021-09-30 08:15:05 - Uses CosineAnnealingLR scheduler.
2021-09-30 08:15:05 - Start training from epoch 0.
/home/emsys/.local/lib/python3.6/site-packages/torch/optim/lr_scheduler.py:123: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
  "https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)
__getitem__  image_id=PartB_02053 
boxes=[[ 903. 1115.  991. 1223.]
 [2267. 1820. 2447. 2062.]
 [2127. 1967. 2415. 2293.]
 [1074. 1928. 1342. 2301.]
 [1420. 1728. 1669. 2030.]
 [1791. 1532. 1908. 1676.]
 [1720. 1425. 1830. 1550.]
 [1505. 1481. 1627. 1618.]
 [1452. 1369. 1562. 1496.]
 [1144. 1547. 1313. 1750.]
 [1208. 1440. 1327. 1569.]
 [1264. 1347. 1374. 1454.]
 [ 825. 1401.  939. 1525.]
 [ 844. 1513.  988. 1657.]
 [ 622. 1628.  818. 1837.]
 [ 269. 1503.  457. 1701.]
 [   3. 1715.  208. 2103.]
 [ 544. 1381.  686. 1542.]
 [ 581. 1320.  693. 1420.]
 [ 205. 1301.  313. 1411.]
 [  49. 1276.  135. 1376.]
 [ 203. 1252.  298. 1347.]
 [ 342. 1235.  430. 1340.]] 
labels=[2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2]
/home/emsys/.local/lib/python3.6/site-packages/torch/nn/_reduction.py:44: UserWarning: size_average and reduce args will be deprecated, please use reduction='sum' instead.
  warnings.warn(warning.format(ret))
__getitem__  image_id=part2_001843 
boxes=[[713.  44. 945. 245.]] 
labels=[1]
2021-09-30 08:15:15 - Epoch: 0, Step: 1/6064, Avg Loss: 51.1179, Avg Regression Loss 32.6321, Avg Classification Loss: 18.4859
__getitem__  image_id=001290 
boxes=[[156.  74. 335. 283.]] 
labels=[2]
2021-09-30 08:15:15 - Epoch: 0, Step: 2/6064, Avg Loss: 14.1361, Avg Regression Loss 3.5483, Avg Classification Loss: 10.5878
__getitem__  image_id=001120 
boxes=[[254. 120. 343. 206.]] 
labels=[1]
2021-09-30 08:15:16 - Epoch: 0, Step: 3/6064, Avg Loss: 15.7471, Avg Regression Loss 8.0907, Avg Classification Loss: 7.6564
__getitem__  image_id=001501 
boxes=[[397.  71. 917. 780.]] 
labels=[1]

...

2021-09-30 08:09:57 - Epoch: 0, Step: 6063/6064, Avg Loss: 5.5368, Avg Regression Loss 2.7356, Avg Classification Loss: 2.8012
__getitem__  image_id=000002 
boxes=[[ 36.  31.  75.  83.]
 [164. 102. 207. 157.]
 [177.  70. 212. 112.]
 [220.  43. 250.  87.]
 [248.  60. 282. 111.]
 [334.  59. 375. 111.]
 [343. 106. 384. 162.]
 [371.  58. 401. 109.]
 [408.  76. 453. 135.]
 [  8.  74.  45. 123.]] 
labels=[1 1 1 1 1 1 1 1 1 2]
__getitem__  image_id=000005 
boxes=[[377. 108. 457. 178.]] 
labels=[1]
__getitem__  image_id=000019 
boxes=[[305. 133. 352. 188.]
 [420. 123. 469. 183.]
 [640.  44. 710. 118.]
 [549.  79. 564.  95.]
 [539.  64. 549.  79.]
 [564.  71. 578.  90.]
 [334.  72. 347.  90.]
 [403.  71. 418.  92.]
 [380.  68. 392.  85.]
 [283.  69. 295.  86.]
 [292.  71. 303.  87.]
 [306.  74. 316.  88.]
 [317.  74. 329.  90.]
 [260.  73. 273.  91.]
 [244.  73. 255.  88.]
 [231.  66. 242.  82.]
 [224.  70. 235.  86.]
 [214.  70. 225.  86.]
 [206.  67. 214.  77.]
 [613.  82. 624.  98.]
 [595.  75. 606.  88.]
 [606.  75. 617.  85.]
 [550.  65. 560.  79.]
 [560.  70. 569.  82.]
 [543.  76. 552.  91.]
 [493.  68. 502.  80.]
 [520.  68. 528.  82.]
 [367.  71. 381.  90.]
 [392.  70. 403.  85.]
 [256.  67. 267.  83.]
 [ 28.  52.  46.  73.]
 [  5.  66.  28.  90.]] 
labels=[1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 2 1 1 2 1 2 2 1]
__getitem__  image_id=000022 
boxes=[] 
labels=[]
Traceback (most recent call last):
  File "train_ssd.py", line 346, in <module>
    val_loss, val_regression_loss, val_classification_loss = test(val_loader, net, criterion, DEVICE)
  File "train_ssd.py", line 150, in test
    for _, data in enumerate(loader):
  File "/home/emsys/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 363, in __next__
    data = self._next_data()
  File "/home/emsys/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 403, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/home/emsys/.local/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/emsys/.local/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/emsys/jetson-inference/python/training/detection/ssd/vision/datasets/voc_dataset.py", line 83, in __getitem__
    boxes, labels = self.target_transform(boxes, labels)
  File "/home/emsys/jetson-inference/python/training/detection/ssd/vision/ssd/ssd.py", line 155, in __call__
    self.corner_form_priors, self.iou_threshold)
  File "/home/emsys/jetson-inference/python/training/detection/ssd/vision/utils/box_utils.py", line 167, in assign_priors
    best_target_per_prior, best_target_per_prior_index = ious.max(1)
RuntimeError: cannot perform reduction function max on tensor with no elements because the operation does not have an identity

I removed image 000022 from the dataset and it appears to work. It is still running, but the first epoch has already passed.

I wonder why this image causes this issue. The annotation file looks like that:

<annotation>
	<folder>hat01</folder>
	<filename>000022.jpg</filename>
	<path>D:\dataset\hat01\000022.jpg</path>
	<source>
		<database>Unknown</database>
	</source>
	<size>
		<width>690</width>
		<height>518</height>
		<depth>3</depth>
	</size>
	<segmented>0</segmented>
	<object>
		<name>hat</name>
		<pose>Unspecified</pose>
		<truncated>0</truncated>
		<difficult>1</difficult>
		<bndbox>
			<xmin>640</xmin>
			<ymin>105</ymin>
			<xmax>660</xmax>
			<ymax>128</ymax>
		</bndbox>
	</object>
</annotation>

Another (obviously working) file looks like that:

<annotation>
	<folder>hat01</folder>
	<filename>000257.jpg</filename>
	<path>D:\dataset\hat01\000257.jpg</path>
	<source>
		<database>Unknown</database>
	</source>
	<size>
		<width>600</width>
		<height>441</height>
		<depth>3</depth>
	</size>
	<segmented>0</segmented>
	<object>
		<name>hat</name>
		<pose>Unspecified</pose>
		<truncated>0</truncated>
		<difficult>1</difficult>
		<bndbox>
			<xmin>363</xmin>
			<ymin>43</ymin>
			<xmax>397</xmax>
			<ymax>77</ymax>
		</bndbox>
	</object>
</annotation>

That appears to be identlical to me. It only differs for filename / path, size and coordinates of the bbox.

Regards
Florian

dusty_nv · September 30, 2021, 2:42pm

Hmm I’m not sure about it, and why that would be. As you pointed out, both XML annotations seem fine. Glad you got it running though!

BTW you can comment that line of code back out again now, so your console isn’t flooded with messages when you are training.

Florian_Faltin · September 30, 2021, 3:15pm

I did already and restarted training (also with higher batch size). Training finished and inference works just fine.
Thanks for your support!

dusty_nv · September 30, 2021, 4:23pm

No problem @Florian_Faltin, happy you got it working!

Topic		Replies	Views
Data corruption when running train_ssd script Jetson Nano python , training	10	914	September 12, 2022
Jetson nano start the Docker an error occurred while training your detection model ：Segmentation fault (core dumped) Jetson Nano jetson-inference	7	1234	April 21, 2022
Train_ssd.py indices error Jetson Nano jetson-inference	12	1720	December 15, 2021
Dusty-nv jetson training custom data sets generating labels Jetson Nano ai-training	27	4412	October 15, 2021
Hello AI World - new object detection training and video interfaces Jetson Nano	29	4491	April 20, 2021
Re-training SSD-Mobilenet: gt_locations consist of nan values which causing Regression Loss to NaN Jetson Nano ai-training	2	922	September 13, 2022
Training Custom FasterRCNN resnet50 Object detection issue TAO Toolkit	9	1119	October 12, 2021
Train_ssd.py dosen't work with pascal voc dataset Jetson Nano ai-training	5	1125	February 9, 2022
No such file or directory: 'data/stapler/sub-train-annotations-bbox.csv' Jetson Nano ai-training	8	1087	February 21, 2023
Error when training LPRNet 2 (characters number < 35) TAO Toolkit	6	1872	September 11, 2021

Training with "train_ssd.py" - error at the end of the dataset

Related topics