Successful training with "train_ssd.py" using small custom data set, but error on full data set

I have successfully trained ssd detectnet with a small test portion of my data set (the first 20 images) but when i try to train with the entire data set (~1100 images) I get the error below:

Because I hand labeled ~1100 images I am very invested in figuring this out!

Background:
These images were labeled with labelImg (https://github.com/tzutalin/labelImg) in VOC output format.

The file names (for ImageSets/Main/…) were extracted with:

find * -type f -print | sed ‘s/.[^.]*$//’

and then pasted into the test,train,trainval and val text files.
(be sure to remove white space at end of paste…ask me how I know)

The folder structure is Annotations, ImageSets/Main, and JPEGImages

Any suggestions why it runs fine with 20 images and crashes on the entire ~1100 image dataset?
Can image boxes not overlap? The first 20 images do not have any “label” overlap. The others do.
I need some guidance…

Note that the first two warnings (with hyperlink) also appeared on the successful runs
Error Below:

python3 train_ssd.py --dataset-type=voc --epochs=100 --data=/XavierSSD/datasets/TrapDet6 --model-dir=TrapDet6
2020-10-12 20:39:07 - Using CUDA…
2020-10-12 20:39:07 - Namespace(balance_data=False, base_net=None, base_net_lr=0.001, batch_size=4, checkpoint_folder=‘TrapDet6’, dataset_type=‘voc’, datasets=[’/XavierSSD/datasets/TrapDet6’], debug_steps=10, extra_layers_lr=None, freeze_base_net=False, freeze_net=False, gamma=0.1, lr=0.01, mb2_width_mult=1.0, milestones=‘80,100’, momentum=0.9, net=‘mb1-ssd’, num_epochs=100, num_workers=2, pretrained_ssd=‘models/mobilenet-v1-ssd-mp-0_675.pth’, resume=None, scheduler=‘cosine’, t_max=100, use_cuda=True, validation_epochs=1, weight_decay=0.0005)
2020-10-12 20:39:07 - Prepare training datasets.
2020-10-12 20:39:07 - VOC Labels read from file: (‘BACKGROUND’, ‘person’)
2020-10-12 20:39:07 - Stored labels into file TrapDet6/labels.txt.
2020-10-12 20:39:07 - Train dataset size: 1084
2020-10-12 20:39:07 - Prepare Validation datasets.
2020-10-12 20:39:07 - VOC Labels read from file: (‘BACKGROUND’, ‘person’)
2020-10-12 20:39:07 - Validation dataset size: 1084
2020-10-12 20:39:07 - Build network.
2020-10-12 20:39:07 - Init from pretrained ssd models/mobilenet-v1-ssd-mp-0_675.pth
2020-10-12 20:39:08 - Took 0.15 seconds to load the model.
2020-10-12 20:39:12 - Learning rate: 0.01, Base net learning rate: 0.001, Extra Layers learning rate: 0.01.
2020-10-12 20:39:12 - Uses CosineAnnealingLR scheduler.
2020-10-12 20:39:12 - Start training from epoch 0.
/home/kel/.local/lib/python3.6/site-packages/torch/optim/lr_scheduler.py:123: UserWarning: Detected call of lr_scheduler.step() before optimizer.step(). In PyTorch 1.1.0 and later, you should call them in the opposite order: optimizer.step() before lr_scheduler.step(). Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate”, UserWarning)
/home/kel/.local/lib/python3.6/site-packages/torch/nn/_reduction.py:44: UserWarning: size_average and reduce args will be deprecated, please use reduction=‘sum’ instead.
warnings.warn(warning.format(ret))
Traceback (most recent call last):
File “train_ssd.py”, line 343, in
device=DEVICE, debug_steps=args.debug_steps, epoch=epoch)
File “train_ssd.py”, line 113, in train
for i, data in enumerate(loader):
File “/home/kel/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 363, in next
data = self._next_data()
File “/home/kel/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 989, in _next_data
return self._process_data(data)
File “/home/kel/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 1014, in _process_data
data.reraise()
File “/home/kel/.local/lib/python3.6/site-packages/torch/_utils.py”, line 395, in reraise
raise self.exc_type(msg)
ValueError: Caught ValueError in DataLoader worker process 1.
Original Traceback (most recent call last):
File “/home/kel/.local/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py”, line 185, in _worker_loop
data = fetcher.fetch(index)
File “/home/kel/.local/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py”, line 44, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File “/home/kel/.local/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py”, line 44, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File “/home/kel/.local/lib/python3.6/site-packages/torch/utils/data/dataset.py”, line 207, in getitem
return self.datasets[dataset_idx][sample_idx]
File “/XavierSSD/jetson-inference/python/training/detection/ssd/vision/datasets/voc_dataset.py”, line 69, in getitem
image, boxes, labels = self.transform(image, boxes, labels)
File “/XavierSSD/jetson-inference/python/training/detection/ssd/vision/ssd/data_preprocessing.py”, line 34, in call
return self.augment(img, boxes, labels)
File “/XavierSSD/jetson-inference/python/training/detection/ssd/vision/transforms/transforms.py”, line 55, in call
img, boxes, labels = t(img, boxes, labels)
File “/XavierSSD/jetson-inference/python/training/detection/ssd/vision/transforms/transforms.py”, line 280, in call
if overlap.min() < min_iou and max_iou < overlap.max():
File “/home/kel/.local/lib/python3.6/site-packages/numpy/core/_methods.py”, line 43, in _amin
return umr_minimum(a, axis, None, out, keepdims, initial, where)
ValueError: zero-size array to reduction operation minimum which has no identity

Hi,

If you deploy a training job on Nano, it is possible to meet some memory issue when enlarging the database.

Would you mind to check if you are running out of the memory first?
This can be done via monitoring the system with tegrastats.

$ sudo tegrastats

Thanks.

Ahh yes good question. I forgot to mention this job is actually training on an Xavier NX from an installed NVME drive.
When executing, the script errors out at about 75% ram usage.

The clues may be in the last few lines of the error message. I don’t have a good understanding of how the program flows and where to begin searching for the hang-up. Is there a good way to have it print to screen what file is is currently loading so I can compare what is different with the first files that load and whatever file its loading when the error occurs?

That’s a good idea - in jetson-inference/python/training/detection/ssd/vision/datasets/voc_dataset.py under line 57 add a statement like:

print('loading image ' + image_id)

I haven’t seen this particular error before, but I wonder if some bounding box in one of the files is corrupted or missing.

That was a hard one to diagnose…

It turns out if you check the “difficult box” in the labelImg software it writes a null when loading for training and that was causing the crash.

DONT CHECK THE DIFFICULT BOX in labelImg

Because I only checked the box like 20 times on the ~1100 image set it was hard to isolate the problem files.
The file that caused the crash would sometimes be hidden 20 files up the list.
Welcome to the Parallel processing world?

Thanks so much for the help with printing the image name to the terminal.
That was instrumental in troubleshooting the problem.

Slightly below that line in voc_dataset.py was a print statement that gave a bit more info so I just uncommented it:
print(‘getitem image_id=’ + str(image_id) + ’ \nboxes=’ + str(boxes) + ’ \nlabels=’ + str(labels))

The incredibly fast support from this forum has allowed me to keep my learning momentum.
Thanks so much!

Thanks @clay1, glad to hear that you got it working. So in the future, if someone has a similar issue, in the VOCDataset constructor you could add a keep_difficult=True option here:

https://github.com/dusty-nv/pytorch-ssd/blob/7e6caf263a6250ec2ea29d21a8c42ef84aa6d7d0/train_ssd.py#L213

I might need to enable that by default, or add a command line option for it. If you are training on the Pascal VOC dataset itself, you probably want this to be false - but for your own datasets, maybe true.