I have successfully trained ssd detectnet with a small test portion of my data set (the first 20 images) but when i try to train with the entire data set (~1100 images) I get the error below:
Because I hand labeled ~1100 images I am very invested in figuring this out!
Background:
These images were labeled with labelImg (GitHub - heartexlabs/labelImg: 🖍️ LabelImg is a graphical image annotation tool and label object bounding boxes in images) in VOC output format.
The file names (for ImageSets/Main/…) were extracted with:
find * -type f -print | sed ‘s/.[^.]*$//’
and then pasted into the test,train,trainval and val text files.
(be sure to remove white space at end of paste…ask me how I know)
The folder structure is Annotations, ImageSets/Main, and JPEGImages
Any suggestions why it runs fine with 20 images and crashes on the entire ~1100 image dataset?
Can image boxes not overlap? The first 20 images do not have any “label” overlap. The others do.
I need some guidance…
Note that the first two warnings (with hyperlink) also appeared on the successful runs
Error Below:
python3 train_ssd.py --dataset-type=voc --epochs=100 --data=/XavierSSD/datasets/TrapDet6 --model-dir=TrapDet6
2020-10-12 20:39:07 - Using CUDA…
2020-10-12 20:39:07 - Namespace(balance_data=False, base_net=None, base_net_lr=0.001, batch_size=4, checkpoint_folder=‘TrapDet6’, dataset_type=‘voc’, datasets=[‘/XavierSSD/datasets/TrapDet6’], debug_steps=10, extra_layers_lr=None, freeze_base_net=False, freeze_net=False, gamma=0.1, lr=0.01, mb2_width_mult=1.0, milestones=‘80,100’, momentum=0.9, net=‘mb1-ssd’, num_epochs=100, num_workers=2, pretrained_ssd=‘models/mobilenet-v1-ssd-mp-0_675.pth’, resume=None, scheduler=‘cosine’, t_max=100, use_cuda=True, validation_epochs=1, weight_decay=0.0005)
2020-10-12 20:39:07 - Prepare training datasets.
2020-10-12 20:39:07 - VOC Labels read from file: (‘BACKGROUND’, ‘person’)
2020-10-12 20:39:07 - Stored labels into file TrapDet6/labels.txt.
2020-10-12 20:39:07 - Train dataset size: 1084
2020-10-12 20:39:07 - Prepare Validation datasets.
2020-10-12 20:39:07 - VOC Labels read from file: (‘BACKGROUND’, ‘person’)
2020-10-12 20:39:07 - Validation dataset size: 1084
2020-10-12 20:39:07 - Build network.
2020-10-12 20:39:07 - Init from pretrained ssd models/mobilenet-v1-ssd-mp-0_675.pth
2020-10-12 20:39:08 - Took 0.15 seconds to load the model.
2020-10-12 20:39:12 - Learning rate: 0.01, Base net learning rate: 0.001, Extra Layers learning rate: 0.01.
2020-10-12 20:39:12 - Uses CosineAnnealingLR scheduler.
2020-10-12 20:39:12 - Start training from epoch 0.
/home/kel/.local/lib/python3.6/site-packages/torch/optim/lr_scheduler.py:123: UserWarning: Detected call of lr_scheduler.step() before optimizer.step(). In PyTorch 1.1.0 and later, you should call them in the opposite order: optimizer.step() before lr_scheduler.step(). Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at torch.optim — PyTorch 1.12 documentation
“torch.optim — PyTorch 1.12 documentation”, UserWarning)
/home/kel/.local/lib/python3.6/site-packages/torch/nn/_reduction.py:44: UserWarning: size_average and reduce args will be deprecated, please use reduction=‘sum’ instead.
warnings.warn(warning.format(ret))
Traceback (most recent call last):
File “train_ssd.py”, line 343, in
device=DEVICE, debug_steps=args.debug_steps, epoch=epoch)
File “train_ssd.py”, line 113, in train
for i, data in enumerate(loader):
File “/home/kel/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 363, in next
data = self._next_data()
File “/home/kel/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 989, in _next_data
return self._process_data(data)
File “/home/kel/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 1014, in _process_data
data.reraise()
File “/home/kel/.local/lib/python3.6/site-packages/torch/_utils.py”, line 395, in reraise
raise self.exc_type(msg)
ValueError: Caught ValueError in DataLoader worker process 1.
Original Traceback (most recent call last):
File “/home/kel/.local/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py”, line 185, in _worker_loop
data = fetcher.fetch(index)
File “/home/kel/.local/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py”, line 44, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File “/home/kel/.local/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py”, line 44, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File “/home/kel/.local/lib/python3.6/site-packages/torch/utils/data/dataset.py”, line 207, in getitem
return self.datasets[dataset_idx][sample_idx]
File “/XavierSSD/jetson-inference/python/training/detection/ssd/vision/datasets/voc_dataset.py”, line 69, in getitem
image, boxes, labels = self.transform(image, boxes, labels)
File “/XavierSSD/jetson-inference/python/training/detection/ssd/vision/ssd/data_preprocessing.py”, line 34, in call
return self.augment(img, boxes, labels)
File “/XavierSSD/jetson-inference/python/training/detection/ssd/vision/transforms/transforms.py”, line 55, in call
img, boxes, labels = t(img, boxes, labels)
File “/XavierSSD/jetson-inference/python/training/detection/ssd/vision/transforms/transforms.py”, line 280, in call
if overlap.min() < min_iou and max_iou < overlap.max():
File “/home/kel/.local/lib/python3.6/site-packages/numpy/core/_methods.py”, line 43, in _amin
return umr_minimum(a, axis, None, out, keepdims, initial, where)
ValueError: zero-size array to reduction operation minimum which has no identity