Training with "train_ssd.py" - error at the end of the dataset

Hi there,

I re-trained the SSD-Mobilenet network according to the description here and a set of images from the open-images database:

That worked out without any issues.

Now I try to do the same again with this dataset:

I already solved some issues to get the training started. But at the end of the first run when all images are processed, I get an issue:

python3 train_ssd.py --dataset-type=voc --data=data/shwd/VOC2028/ --model-dir=models/shwd --batch-size=4 --epochs=30
2021-09-29 10:39:22 - Using CUDA...
2021-09-29 10:39:22 - Namespace(balance_data=False, base_net=None, base_net_lr=0.001, batch_size=4, checkpoint_folder='models/shwd', dataset_type='voc', datasets=['data/shwd/VOC2028/'], debug_steps=10, extra_layers_lr=None, freeze_base_net=False, freeze_net=False, gamma=0.1, lr=0.01, mb2_width_mult=1.0, milestones='80,100', momentum=0.9, net='mb1-ssd', num_epochs=30, num_workers=2, pretrained_ssd='models/mobilenet-v1-ssd-mp-0_675.pth', resume=None, scheduler='cosine', t_max=100, use_cuda=True, validation_epochs=1, weight_decay=0.0005)
2021-09-29 10:39:22 - Prepare training datasets.
2021-09-29 10:39:24 - VOC Labels read from file: ('BACKGROUND', 'hat', 'person')
2021-09-29 10:39:24 - Stored labels into file models/shwd/labels.txt.
2021-09-29 10:39:24 - Train dataset size: 6064
2021-09-29 10:39:24 - Prepare Validation datasets.
2021-09-29 10:39:25 - VOC Labels read from file: ('BACKGROUND', 'hat', 'person')
2021-09-29 10:39:25 - Validation dataset size: 1517
2021-09-29 10:39:25 - Build network.
2021-09-29 10:39:25 - Init from pretrained ssd models/mobilenet-v1-ssd-mp-0_675.pth
2021-09-29 10:39:25 - Took 0.10 seconds to load the model.
2021-09-29 10:39:29 - Learning rate: 0.01, Base net learning rate: 0.001, Extra Layers learning rate: 0.01.
2021-09-29 10:39:29 - Uses CosineAnnealingLR scheduler.
2021-09-29 10:39:29 - Start training from epoch 0.
/home/emsys/.local/lib/python3.6/site-packages/torch/optim/lr_scheduler.py:123: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
  "https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)
/home/emsys/.local/lib/python3.6/site-packages/torch/nn/_reduction.py:44: UserWarning: size_average and reduce args will be deprecated, please use reduction='sum' instead.
  warnings.warn(warning.format(ret))
2021-09-29 10:39:40 - Epoch: 0, Step: 10/1516, Avg Loss: 15.1348, Avg Regression Loss 9.6635, Avg Classification Loss: 5.4713
2021-09-29 10:39:42 - Epoch: 0, Step: 20/1516, Avg Loss: 9.5630, Avg Regression Loss 5.6885, Avg Classification Loss: 3.8745
2021-09-29 10:39:44 - Epoch: 0, Step: 30/1516, Avg Loss: 9.4334, Avg Regression Loss 5.8865, Avg Classification Loss: 3.5469
2021-09-29 10:39:47 - Epoch: 0, Step: 40/1516, Avg Loss: 7.9035, Avg Regression Loss 4.2629, Avg Classification Loss: 3.6406

...

2021-09-29 10:45:50 - Epoch: 0, Step: 1500/1516, Avg Loss: 4.1119, Avg Regression Loss 1.7915, Avg Classification Loss: 2.3204
2021-09-29 10:45:52 - Epoch: 0, Step: 1510/1516, Avg Loss: 4.3096, Avg Regression Loss 2.0656, Avg Classification Loss: 2.2440
Traceback (most recent call last):
  File "train_ssd.py", line 346, in <module>
    val_loss, val_regression_loss, val_classification_loss = test(val_loader, net, criterion, DEVICE)
  File "train_ssd.py", line 150, in test
    for _, data in enumerate(loader):
  File "/home/emsys/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 363, in __next__
    data = self._next_data()
  File "/home/emsys/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 989, in _next_data
    return self._process_data(data)
  File "/home/emsys/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 1014, in _process_data
    data.reraise()
  File "/home/emsys/.local/lib/python3.6/site-packages/torch/_utils.py", line 395, in reraise
    raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/emsys/.local/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 185, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/emsys/.local/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/emsys/.local/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/emsys/jetson-inference/python/training/detection/ssd/vision/datasets/voc_dataset.py", line 83, in __getitem__
    boxes, labels = self.target_transform(boxes, labels)
  File "/home/emsys/jetson-inference/python/training/detection/ssd/vision/ssd/ssd.py", line 155, in __call__
    self.corner_form_priors, self.iou_threshold)
  File "/home/emsys/jetson-inference/python/training/detection/ssd/vision/utils/box_utils.py", line 167, in assign_priors
    best_target_per_prior, best_target_per_prior_index = ious.max(1)
RuntimeError: cannot perform reduction function max on tensor with no elements because the operation does not have an identity

I couldn’t find any solution on this error. Could somebody help me, please?

Thanks
Florian

Hi @Florian_Faltin, I recommend to uncomment this line of code, which will print out the ID/metadata of each image as it loads it:

Also run it with --batch-size=1 --workers=0 --debug-steps=1
Then look at the last image that was loaded before the exception occurs, and probably remove this from the dataset’s ImageLists.

1 Like

Thank you @dusty_nv for the prompt reply.
This is the output with the modified script and command:

$ python3 train_ssd.py --dataset-type=voc --data=data/shwd/VOC2028/ --model-dir=models/shwd/ --batch-size=1 --workers=0 --debug-steps=1
2021-09-30 08:14:58 - Using CUDA...
2021-09-30 08:14:58 - Namespace(balance_data=False, base_net=None, base_net_lr=0.001, batch_size=1, checkpoint_folder='models/shwd/', dataset_type='voc', datasets=['data/shwd/VOC2028/'], debug_steps=1, extra_layers_lr=None, freeze_base_net=False, freeze_net=False, gamma=0.1, lr=0.01, mb2_width_mult=1.0, milestones='80,100', momentum=0.9, net='mb1-ssd', num_epochs=30, num_workers=0, pretrained_ssd='models/mobilenet-v1-ssd-mp-0_675.pth', resume=None, scheduler='cosine', t_max=100, use_cuda=True, validation_epochs=1, weight_decay=0.0005)
2021-09-30 08:14:58 - Prepare training datasets.
2021-09-30 08:15:00 - VOC Labels read from file: ('BACKGROUND', 'hat', 'person')
2021-09-30 08:15:00 - Stored labels into file models/shwd/labels.txt.
2021-09-30 08:15:00 - Train dataset size: 6064
2021-09-30 08:15:00 - Prepare Validation datasets.
2021-09-30 08:15:01 - VOC Labels read from file: ('BACKGROUND', 'hat', 'person')
2021-09-30 08:15:01 - Validation dataset size: 1517
2021-09-30 08:15:01 - Build network.
2021-09-30 08:15:01 - Init from pretrained ssd models/mobilenet-v1-ssd-mp-0_675.pth
2021-09-30 08:15:01 - Took 0.09 seconds to load the model.
2021-09-30 08:15:05 - Learning rate: 0.01, Base net learning rate: 0.001, Extra Layers learning rate: 0.01.
2021-09-30 08:15:05 - Uses CosineAnnealingLR scheduler.
2021-09-30 08:15:05 - Start training from epoch 0.
/home/emsys/.local/lib/python3.6/site-packages/torch/optim/lr_scheduler.py:123: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
  "https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)
__getitem__  image_id=PartB_02053 
boxes=[[ 903. 1115.  991. 1223.]
 [2267. 1820. 2447. 2062.]
 [2127. 1967. 2415. 2293.]
 [1074. 1928. 1342. 2301.]
 [1420. 1728. 1669. 2030.]
 [1791. 1532. 1908. 1676.]
 [1720. 1425. 1830. 1550.]
 [1505. 1481. 1627. 1618.]
 [1452. 1369. 1562. 1496.]
 [1144. 1547. 1313. 1750.]
 [1208. 1440. 1327. 1569.]
 [1264. 1347. 1374. 1454.]
 [ 825. 1401.  939. 1525.]
 [ 844. 1513.  988. 1657.]
 [ 622. 1628.  818. 1837.]
 [ 269. 1503.  457. 1701.]
 [   3. 1715.  208. 2103.]
 [ 544. 1381.  686. 1542.]
 [ 581. 1320.  693. 1420.]
 [ 205. 1301.  313. 1411.]
 [  49. 1276.  135. 1376.]
 [ 203. 1252.  298. 1347.]
 [ 342. 1235.  430. 1340.]] 
labels=[2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2]
/home/emsys/.local/lib/python3.6/site-packages/torch/nn/_reduction.py:44: UserWarning: size_average and reduce args will be deprecated, please use reduction='sum' instead.
  warnings.warn(warning.format(ret))
__getitem__  image_id=part2_001843 
boxes=[[713.  44. 945. 245.]] 
labels=[1]
2021-09-30 08:15:15 - Epoch: 0, Step: 1/6064, Avg Loss: 51.1179, Avg Regression Loss 32.6321, Avg Classification Loss: 18.4859
__getitem__  image_id=001290 
boxes=[[156.  74. 335. 283.]] 
labels=[2]
2021-09-30 08:15:15 - Epoch: 0, Step: 2/6064, Avg Loss: 14.1361, Avg Regression Loss 3.5483, Avg Classification Loss: 10.5878
__getitem__  image_id=001120 
boxes=[[254. 120. 343. 206.]] 
labels=[1]
2021-09-30 08:15:16 - Epoch: 0, Step: 3/6064, Avg Loss: 15.7471, Avg Regression Loss 8.0907, Avg Classification Loss: 7.6564
__getitem__  image_id=001501 
boxes=[[397.  71. 917. 780.]] 
labels=[1]

...

2021-09-30 08:09:57 - Epoch: 0, Step: 6063/6064, Avg Loss: 5.5368, Avg Regression Loss 2.7356, Avg Classification Loss: 2.8012
__getitem__  image_id=000002 
boxes=[[ 36.  31.  75.  83.]
 [164. 102. 207. 157.]
 [177.  70. 212. 112.]
 [220.  43. 250.  87.]
 [248.  60. 282. 111.]
 [334.  59. 375. 111.]
 [343. 106. 384. 162.]
 [371.  58. 401. 109.]
 [408.  76. 453. 135.]
 [  8.  74.  45. 123.]] 
labels=[1 1 1 1 1 1 1 1 1 2]
__getitem__  image_id=000005 
boxes=[[377. 108. 457. 178.]] 
labels=[1]
__getitem__  image_id=000019 
boxes=[[305. 133. 352. 188.]
 [420. 123. 469. 183.]
 [640.  44. 710. 118.]
 [549.  79. 564.  95.]
 [539.  64. 549.  79.]
 [564.  71. 578.  90.]
 [334.  72. 347.  90.]
 [403.  71. 418.  92.]
 [380.  68. 392.  85.]
 [283.  69. 295.  86.]
 [292.  71. 303.  87.]
 [306.  74. 316.  88.]
 [317.  74. 329.  90.]
 [260.  73. 273.  91.]
 [244.  73. 255.  88.]
 [231.  66. 242.  82.]
 [224.  70. 235.  86.]
 [214.  70. 225.  86.]
 [206.  67. 214.  77.]
 [613.  82. 624.  98.]
 [595.  75. 606.  88.]
 [606.  75. 617.  85.]
 [550.  65. 560.  79.]
 [560.  70. 569.  82.]
 [543.  76. 552.  91.]
 [493.  68. 502.  80.]
 [520.  68. 528.  82.]
 [367.  71. 381.  90.]
 [392.  70. 403.  85.]
 [256.  67. 267.  83.]
 [ 28.  52.  46.  73.]
 [  5.  66.  28.  90.]] 
labels=[1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 2 1 1 2 1 2 2 1]
__getitem__  image_id=000022 
boxes=[] 
labels=[]
Traceback (most recent call last):
  File "train_ssd.py", line 346, in <module>
    val_loss, val_regression_loss, val_classification_loss = test(val_loader, net, criterion, DEVICE)
  File "train_ssd.py", line 150, in test
    for _, data in enumerate(loader):
  File "/home/emsys/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 363, in __next__
    data = self._next_data()
  File "/home/emsys/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 403, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/home/emsys/.local/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/emsys/.local/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/emsys/jetson-inference/python/training/detection/ssd/vision/datasets/voc_dataset.py", line 83, in __getitem__
    boxes, labels = self.target_transform(boxes, labels)
  File "/home/emsys/jetson-inference/python/training/detection/ssd/vision/ssd/ssd.py", line 155, in __call__
    self.corner_form_priors, self.iou_threshold)
  File "/home/emsys/jetson-inference/python/training/detection/ssd/vision/utils/box_utils.py", line 167, in assign_priors
    best_target_per_prior, best_target_per_prior_index = ious.max(1)
RuntimeError: cannot perform reduction function max on tensor with no elements because the operation does not have an identity

I removed image 000022 from the dataset and it appears to work. It is still running, but the first epoch has already passed.

I wonder why this image causes this issue. The annotation file looks like that:

<annotation>
	<folder>hat01</folder>
	<filename>000022.jpg</filename>
	<path>D:\dataset\hat01\000022.jpg</path>
	<source>
		<database>Unknown</database>
	</source>
	<size>
		<width>690</width>
		<height>518</height>
		<depth>3</depth>
	</size>
	<segmented>0</segmented>
	<object>
		<name>hat</name>
		<pose>Unspecified</pose>
		<truncated>0</truncated>
		<difficult>1</difficult>
		<bndbox>
			<xmin>640</xmin>
			<ymin>105</ymin>
			<xmax>660</xmax>
			<ymax>128</ymax>
		</bndbox>
	</object>
</annotation>

Another (obviously working) file looks like that:

<annotation>
	<folder>hat01</folder>
	<filename>000257.jpg</filename>
	<path>D:\dataset\hat01\000257.jpg</path>
	<source>
		<database>Unknown</database>
	</source>
	<size>
		<width>600</width>
		<height>441</height>
		<depth>3</depth>
	</size>
	<segmented>0</segmented>
	<object>
		<name>hat</name>
		<pose>Unspecified</pose>
		<truncated>0</truncated>
		<difficult>1</difficult>
		<bndbox>
			<xmin>363</xmin>
			<ymin>43</ymin>
			<xmax>397</xmax>
			<ymax>77</ymax>
		</bndbox>
	</object>
</annotation>

That appears to be identlical to me. It only differs for filename / path, size and coordinates of the bbox.

Regards
Florian

Hmm I’m not sure about it, and why that would be. As you pointed out, both XML annotations seem fine. Glad you got it running though!

BTW you can comment that line of code back out again now, so your console isn’t flooded with messages when you are training.

I did already and restarted training (also with higher batch size). Training finished and inference works just fine.
Thanks for your support!

No problem @Florian_Faltin, happy you got it working!