Hi there,
I re-trained the SSD-Mobilenet network according to the description here and a set of images from the open-images database:
That worked out without any issues.
Now I try to do the same again with this dataset:
I already solved some issues to get the training started. But at the end of the first run when all images are processed, I get an issue:
python3 train_ssd.py --dataset-type=voc --data=data/shwd/VOC2028/ --model-dir=models/shwd --batch-size=4 --epochs=30
2021-09-29 10:39:22 - Using CUDA...
2021-09-29 10:39:22 - Namespace(balance_data=False, base_net=None, base_net_lr=0.001, batch_size=4, checkpoint_folder='models/shwd', dataset_type='voc', datasets=['data/shwd/VOC2028/'], debug_steps=10, extra_layers_lr=None, freeze_base_net=False, freeze_net=False, gamma=0.1, lr=0.01, mb2_width_mult=1.0, milestones='80,100', momentum=0.9, net='mb1-ssd', num_epochs=30, num_workers=2, pretrained_ssd='models/mobilenet-v1-ssd-mp-0_675.pth', resume=None, scheduler='cosine', t_max=100, use_cuda=True, validation_epochs=1, weight_decay=0.0005)
2021-09-29 10:39:22 - Prepare training datasets.
2021-09-29 10:39:24 - VOC Labels read from file: ('BACKGROUND', 'hat', 'person')
2021-09-29 10:39:24 - Stored labels into file models/shwd/labels.txt.
2021-09-29 10:39:24 - Train dataset size: 6064
2021-09-29 10:39:24 - Prepare Validation datasets.
2021-09-29 10:39:25 - VOC Labels read from file: ('BACKGROUND', 'hat', 'person')
2021-09-29 10:39:25 - Validation dataset size: 1517
2021-09-29 10:39:25 - Build network.
2021-09-29 10:39:25 - Init from pretrained ssd models/mobilenet-v1-ssd-mp-0_675.pth
2021-09-29 10:39:25 - Took 0.10 seconds to load the model.
2021-09-29 10:39:29 - Learning rate: 0.01, Base net learning rate: 0.001, Extra Layers learning rate: 0.01.
2021-09-29 10:39:29 - Uses CosineAnnealingLR scheduler.
2021-09-29 10:39:29 - Start training from epoch 0.
/home/emsys/.local/lib/python3.6/site-packages/torch/optim/lr_scheduler.py:123: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`. Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
"https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)
/home/emsys/.local/lib/python3.6/site-packages/torch/nn/_reduction.py:44: UserWarning: size_average and reduce args will be deprecated, please use reduction='sum' instead.
warnings.warn(warning.format(ret))
2021-09-29 10:39:40 - Epoch: 0, Step: 10/1516, Avg Loss: 15.1348, Avg Regression Loss 9.6635, Avg Classification Loss: 5.4713
2021-09-29 10:39:42 - Epoch: 0, Step: 20/1516, Avg Loss: 9.5630, Avg Regression Loss 5.6885, Avg Classification Loss: 3.8745
2021-09-29 10:39:44 - Epoch: 0, Step: 30/1516, Avg Loss: 9.4334, Avg Regression Loss 5.8865, Avg Classification Loss: 3.5469
2021-09-29 10:39:47 - Epoch: 0, Step: 40/1516, Avg Loss: 7.9035, Avg Regression Loss 4.2629, Avg Classification Loss: 3.6406
...
2021-09-29 10:45:50 - Epoch: 0, Step: 1500/1516, Avg Loss: 4.1119, Avg Regression Loss 1.7915, Avg Classification Loss: 2.3204
2021-09-29 10:45:52 - Epoch: 0, Step: 1510/1516, Avg Loss: 4.3096, Avg Regression Loss 2.0656, Avg Classification Loss: 2.2440
Traceback (most recent call last):
File "train_ssd.py", line 346, in <module>
val_loss, val_regression_loss, val_classification_loss = test(val_loader, net, criterion, DEVICE)
File "train_ssd.py", line 150, in test
for _, data in enumerate(loader):
File "/home/emsys/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 363, in __next__
data = self._next_data()
File "/home/emsys/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 989, in _next_data
return self._process_data(data)
File "/home/emsys/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 1014, in _process_data
data.reraise()
File "/home/emsys/.local/lib/python3.6/site-packages/torch/_utils.py", line 395, in reraise
raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/home/emsys/.local/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 185, in _worker_loop
data = fetcher.fetch(index)
File "/home/emsys/.local/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/emsys/.local/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/emsys/jetson-inference/python/training/detection/ssd/vision/datasets/voc_dataset.py", line 83, in __getitem__
boxes, labels = self.target_transform(boxes, labels)
File "/home/emsys/jetson-inference/python/training/detection/ssd/vision/ssd/ssd.py", line 155, in __call__
self.corner_form_priors, self.iou_threshold)
File "/home/emsys/jetson-inference/python/training/detection/ssd/vision/utils/box_utils.py", line 167, in assign_priors
best_target_per_prior, best_target_per_prior_index = ious.max(1)
RuntimeError: cannot perform reduction function max on tensor with no elements because the operation does not have an identity
I couldn’t find any solution on this error. Could somebody help me, please?
Thanks
Florian