Successful training with "train_ssd.py" using small custom data set, but error on full data set

clay1 · October 13, 2020, 2:24am

I have successfully trained ssd detectnet with a small test portion of my data set (the first 20 images) but when i try to train with the entire data set (~1100 images) I get the error below:

Because I hand labeled ~1100 images I am very invested in figuring this out!

Background:
These images were labeled with labelImg (GitHub - heartexlabs/labelImg: 🖍️ LabelImg is a graphical image annotation tool and label object bounding boxes in images) in VOC output format.

The file names (for ImageSets/Main/…) were extracted with:

find * -type f -print | sed ‘s/.[^.]*$//’

and then pasted into the test,train,trainval and val text files.
(be sure to remove white space at end of paste…ask me how I know)

The folder structure is Annotations, ImageSets/Main, and JPEGImages

Any suggestions why it runs fine with 20 images and crashes on the entire ~1100 image dataset?
Can image boxes not overlap? The first 20 images do not have any “label” overlap. The others do.
I need some guidance…

Note that the first two warnings (with hyperlink) also appeared on the successful runs
Error Below:

python3 train_ssd.py --dataset-type=voc --epochs=100 --data=/XavierSSD/datasets/TrapDet6 --model-dir=TrapDet6
2020-10-12 20:39:07 - Using CUDA…
2020-10-12 20:39:07 - Namespace(balance_data=False, base_net=None, base_net_lr=0.001, batch_size=4, checkpoint_folder=‘TrapDet6’, dataset_type=‘voc’, datasets=[‘/XavierSSD/datasets/TrapDet6’], debug_steps=10, extra_layers_lr=None, freeze_base_net=False, freeze_net=False, gamma=0.1, lr=0.01, mb2_width_mult=1.0, milestones=‘80,100’, momentum=0.9, net=‘mb1-ssd’, num_epochs=100, num_workers=2, pretrained_ssd=‘models/mobilenet-v1-ssd-mp-0_675.pth’, resume=None, scheduler=‘cosine’, t_max=100, use_cuda=True, validation_epochs=1, weight_decay=0.0005)
2020-10-12 20:39:07 - Prepare training datasets.
2020-10-12 20:39:07 - VOC Labels read from file: (‘BACKGROUND’, ‘person’)
2020-10-12 20:39:07 - Stored labels into file TrapDet6/labels.txt.
2020-10-12 20:39:07 - Train dataset size: 1084
2020-10-12 20:39:07 - Prepare Validation datasets.
2020-10-12 20:39:07 - VOC Labels read from file: (‘BACKGROUND’, ‘person’)
2020-10-12 20:39:07 - Validation dataset size: 1084
2020-10-12 20:39:07 - Build network.
2020-10-12 20:39:07 - Init from pretrained ssd models/mobilenet-v1-ssd-mp-0_675.pth
2020-10-12 20:39:08 - Took 0.15 seconds to load the model.
2020-10-12 20:39:12 - Learning rate: 0.01, Base net learning rate: 0.001, Extra Layers learning rate: 0.01.
2020-10-12 20:39:12 - Uses CosineAnnealingLR scheduler.
2020-10-12 20:39:12 - Start training from epoch 0.
/home/kel/.local/lib/python3.6/site-packages/torch/optim/lr_scheduler.py:123: UserWarning: Detected call of lr_scheduler.step() before optimizer.step(). In PyTorch 1.1.0 and later, you should call them in the opposite order: optimizer.step() before lr_scheduler.step(). Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at torch.optim — PyTorch 1.12 documentation
“torch.optim — PyTorch 1.12 documentation”, UserWarning)
/home/kel/.local/lib/python3.6/site-packages/torch/nn/_reduction.py:44: UserWarning: size_average and reduce args will be deprecated, please use reduction=‘sum’ instead.
warnings.warn(warning.format(ret))
Traceback (most recent call last):
File “train_ssd.py”, line 343, in
device=DEVICE, debug_steps=args.debug_steps, epoch=epoch)
File “train_ssd.py”, line 113, in train
for i, data in enumerate(loader):
File “/home/kel/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 363, in next
data = self._next_data()
File “/home/kel/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 989, in _next_data
return self._process_data(data)
File “/home/kel/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 1014, in _process_data
data.reraise()
File “/home/kel/.local/lib/python3.6/site-packages/torch/_utils.py”, line 395, in reraise
raise self.exc_type(msg)
ValueError: Caught ValueError in DataLoader worker process 1.
Original Traceback (most recent call last):
File “/home/kel/.local/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py”, line 185, in _worker_loop
data = fetcher.fetch(index)
File “/home/kel/.local/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py”, line 44, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File “/home/kel/.local/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py”, line 44, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File “/home/kel/.local/lib/python3.6/site-packages/torch/utils/data/dataset.py”, line 207, in getitem
return self.datasets[dataset_idx][sample_idx]
File “/XavierSSD/jetson-inference/python/training/detection/ssd/vision/datasets/voc_dataset.py”, line 69, in getitem
image, boxes, labels = self.transform(image, boxes, labels)
File “/XavierSSD/jetson-inference/python/training/detection/ssd/vision/ssd/data_preprocessing.py”, line 34, in call
return self.augment(img, boxes, labels)
File “/XavierSSD/jetson-inference/python/training/detection/ssd/vision/transforms/transforms.py”, line 55, in call
img, boxes, labels = t(img, boxes, labels)
File “/XavierSSD/jetson-inference/python/training/detection/ssd/vision/transforms/transforms.py”, line 280, in call
if overlap.min() < min_iou and max_iou < overlap.max():
File “/home/kel/.local/lib/python3.6/site-packages/numpy/core/_methods.py”, line 43, in _amin
return umr_minimum(a, axis, None, out, keepdims, initial, where)
ValueError: zero-size array to reduction operation minimum which has no identity

AastaLLL · October 13, 2020, 7:55am

Hi,

If you deploy a training job on Nano, it is possible to meet some memory issue when enlarging the database.

Would you mind to check if you are running out of the memory first?
This can be done via monitoring the system with tegrastats.

$ sudo tegrastats

Thanks.

clay1 · October 13, 2020, 4:43pm

Ahh yes good question. I forgot to mention this job is actually training on an Xavier NX from an installed NVME drive.
When executing, the script errors out at about 75% ram usage.

The clues may be in the last few lines of the error message. I don’t have a good understanding of how the program flows and where to begin searching for the hang-up. Is there a good way to have it print to screen what file is is currently loading so I can compare what is different with the first files that load and whatever file its loading when the error occurs?

dusty_nv · October 13, 2020, 4:54pm

That’s a good idea - in jetson-inference/python/training/detection/ssd/vision/datasets/voc_dataset.py under line 57 add a statement like:

print('loading image ' + image_id)

I haven’t seen this particular error before, but I wonder if some bounding box in one of the files is corrupted or missing.

clay1 · October 14, 2020, 9:49pm

That was a hard one to diagnose…

It turns out if you check the “difficult box” in the labelImg software it writes a null when loading for training and that was causing the crash.

DONT CHECK THE DIFFICULT BOX in labelImg

Because I only checked the box like 20 times on the ~1100 image set it was hard to isolate the problem files.
The file that caused the crash would sometimes be hidden 20 files up the list.
Welcome to the Parallel processing world?

Thanks so much for the help with printing the image name to the terminal.
That was instrumental in troubleshooting the problem.

Slightly below that line in voc_dataset.py was a print statement that gave a bit more info so I just uncommented it:
print(‘getitem image_id=’ + str(image_id) + ’ \nboxes=’ + str(boxes) + ’ \nlabels=’ + str(labels))

The incredibly fast support from this forum has allowed me to keep my learning momentum.
Thanks so much!

dusty_nv · October 15, 2020, 4:17pm

Thanks @clay1, glad to hear that you got it working. So in the future, if someone has a similar issue, in the VOCDataset constructor you could add a keep_difficult=True option here:

https://github.com/dusty-nv/pytorch-ssd/blob/7e6caf263a6250ec2ea29d21a8c42ef84aa6d7d0/train_ssd.py#L213

I might need to enable that by default, or add a command line option for it. If you are training on the Pascal VOC dataset itself, you probably want this to be false - but for your own datasets, maybe true.

Topic		Replies	Views
Train_ssd.py dosen't work with pascal voc dataset Jetson Nano ai-training	5	1212	February 9, 2022
Train_ssd.py indices error Jetson Nano jetson-inference	12	1835	December 15, 2021
Training with "train_ssd.py" - error at the end of the dataset Jetson AGX Xavier	6	1326	October 18, 2021
Jetson nano start the Docker an error occurred while training your detection model ：Segmentation fault (core dumped) Jetson Nano jetson-inference	7	1334	April 21, 2022
Train_ssd.py error - Training Object Detection Models Jetson Nano ai-training	10	1552	October 6, 2022
Train_ssh.py only works with one dataset; other one returns Loss: nan Jetson Nano ai-training	4	680	October 15, 2021
Jetson-inference: cannot train model with custom data set Jetson Nano jetson-inference	11	2077	March 9, 2022
Problems with train_ssd.py Jetson Nano	2	1062	October 14, 2021
Meeting error in episode 5(Training Object Detection Models) Jetson Nano ai-training	2	570	July 12, 2022
Can't train the SSD Mobilenet using Jetson Nano and Custom Dataset Jetson Nano tensorrt , jetson-inference , nano	2	1269	April 5, 2022

Successful training with "train_ssd.py" using small custom data set, but error on full data set

Related topics