Jetson nano start the Docker an error occurred while training your detection model :Segmentation fault (core dumped)

root@nano-desktop:/jetson-inference# python3 train_ssd.py --dataset-type=voc --data=data/myTrain --model-dir=myModel --batch-size=2 --workers=1 --epochs=1
python3: can’t open file ‘train_ssd.py’: [Errno 2] No such file or directory
root@nano-desktop:/jetson-inference# cd python
root@nano-desktop:/jetson-inference/python# cd training
root@nano-desktop:/jetson-inference/python/training# cd detection
root@nano-desktop:/jetson-inference/python/training/detection# cd ssd
root@nano-desktop:/jetson-inference/python/training/detection/ssd# python3 train_ssd.py --dataset-type=voc --data=data/myTrain --model-dir=myModel --batch-size=2 --workers=1 --epochs=1
2022-04-11 06:27:26 - Using CUDA…
2022-04-11 06:27:26 - Namespace(balance_data=False, base_net=None, base_net_lr=0.001, batch_size=2, checkpoint_folder=‘myModel’, dataset_type=‘voc’, datasets=[‘data/myTrain’], debug_steps=10, extra_layers_lr=None, freeze_base_net=False, freeze_net=False, gamma=0.1, lr=0.01, mb2_width_mult=1.0, milestones=‘80,100’, momentum=0.9, net=‘mb1-ssd’, num_epochs=1, num_workers=1, pretrained_ssd=‘models/mobilenet-v1-ssd-mp-0_675.pth’, resume=None, scheduler=‘cosine’, t_max=100, use_cuda=True, validation_epochs=1, weight_decay=0.0005)
2022-04-11 06:27:26 - Prepare training datasets.
warning - image 20220411-141059 has no box/labels annotations, ignoring from dataset
2022-04-11 06:27:26 - VOC Labels read from file: (‘BACKGROUND’, ‘A’, ‘B’)
2022-04-11 06:27:26 - Stored labels into file myModel/labels.txt.
2022-04-11 06:27:26 - Train dataset size: 23
2022-04-11 06:27:26 - Prepare Validation datasets.
warning - image 20220411-141059 has no box/labels annotations, ignoring from dataset
2022-04-11 06:27:26 - VOC Labels read from file: (‘BACKGROUND’, ‘A’, ‘B’)
2022-04-11 06:27:26 - Validation dataset size: 23
2022-04-11 06:27:26 - Build network.
2022-04-11 06:27:26 - Init from pretrained ssd models/mobilenet-v1-ssd-mp-0_675.pth
2022-04-11 06:27:27 - Took 0.40 seconds to load the model.
2022-04-11 06:27:36 - Learning rate: 0.01, Base net learning rate: 0.001, Extra Layers learning rate: 0.01.
2022-04-11 06:27:36 - Uses CosineAnnealingLR scheduler.
2022-04-11 06:27:36 - Start training from epoch 0.
/usr/local/lib/python3.6/dist-packages/torch/optim/lr_scheduler.py:123: UserWarning: Detected call of lr_scheduler.step() before optimizer.step(). In PyTorch 1.1.0 and later, you should call them in the opposite order: optimizer.step() before lr_scheduler.step(). Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at torch.optim — PyTorch 1.11.0 documentation
torch.optim — PyTorch 1.11.0 documentation”, UserWarning)
/usr/local/lib/python3.6/dist-packages/torch/nn/_reduction.py:44: UserWarning: size_average and reduce args will be deprecated, please use reduction=‘sum’ instead.
warnings.warn(warning.format(ret))
2022-04-11 06:27:52 - Epoch: 0, Step: 10/12, Avg Loss: 10.0826, Avg Regression Loss 3.5129, Avg Classification Loss: 6.5696
2022-04-11 06:27:58 - Epoch: 0, Validation Loss: 11.0960, Validation Regression Loss 3.6998, Validation Classification Loss: 7.3962
2022-04-11 06:27:58 - Saved model myModel/mb1-ssd-Epoch-0-Loss-11.09600555896759.pth
2022-04-11 06:27:58 - Task done, exiting program.
Segmentation fault (core dumped)

QQ图片20220411150419

The generated model file is locked

Hi,

First, please try if add more memory as below can help the segmentation error:

Since you write the file with the root account (docker), please open it with root or change it to other owners first.
Thanks.

root@nano-desktop:/jetson-inference/python/training/detection/ssd# python3 train_ssd.py --dataset-type=voc --data=data/fand --model-dir=models/fand --batch-size=2 --workers=1 --epochs=1
2022-04-11 10:39:35 - Using CUDA…
2022-04-11 10:39:35 - Namespace(balance_data=False, base_net=None, base_net_lr=0.001, batch_size=2, checkpoint_folder=‘models/fand’, dataset_type=‘voc’, datasets=[‘data/fand’], debug_steps=10, extra_layers_lr=None, freeze_base_net=False, freeze_net=False, gamma=0.1, lr=0.01, mb2_width_mult=1.0, milestones=‘80,100’, momentum=0.9, net=‘mb1-ssd’, num_epochs=1, num_workers=1, pretrained_ssd=‘models/mobilenet-v1-ssd-mp-0_675.pth’, resume=None, scheduler=‘cosine’, t_max=100, use_cuda=True, validation_epochs=1, weight_decay=0.0005)
2022-04-11 10:39:35 - Prepare training datasets.
2022-04-11 10:39:35 - No labels file, using default VOC classes.
2022-04-11 10:39:35 - Stored labels into file models/fand/labels.txt.
2022-04-11 10:39:35 - Train dataset size: 20
2022-04-11 10:39:35 - Prepare Validation datasets.
2022-04-11 10:39:35 - No labels file, using default VOC classes.
2022-04-11 10:39:35 - Validation dataset size: 18
2022-04-11 10:39:35 - Build network.
2022-04-11 10:39:36 - Init from pretrained ssd models/mobilenet-v1-ssd-mp-0_675.pth
2022-04-11 10:39:36 - Took 0.43 seconds to load the model.
2022-04-11 10:39:46 - Learning rate: 0.01, Base net learning rate: 0.001, Extra Layers learning rate: 0.01.
2022-04-11 10:39:46 - Uses CosineAnnealingLR scheduler.
2022-04-11 10:39:46 - Start training from epoch 0.
/usr/local/lib/python3.6/dist-packages/torch/optim/lr_scheduler.py:123: UserWarning: Detected call of lr_scheduler.step() before optimizer.step(). In PyTorch 1.1.0 and later, you should call them in the opposite order: optimizer.step() before lr_scheduler.step(). Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at torch.optim — PyTorch 1.11.0 documentation
torch.optim — PyTorch 1.11.0 documentation”, UserWarning)
warning - image 20220411-103513 has object with unknown class ‘B’
warning - image 20220411-103556 has object with unknown class ‘B’
Traceback (most recent call last):
File “train_ssd.py”, line 343, in
device=DEVICE, debug_steps=args.debug_steps, epoch=epoch)
File “train_ssd.py”, line 113, in train
for i, data in enumerate(loader):
File “/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py”, line 354, in next
data = self._next_data()
File “/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py”, line 980, in _next_data
return self._process_data(data)
File “/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py”, line 1005, in _process_data
data.reraise()
File “/usr/local/lib/python3.6/dist-packages/torch/_utils.py”, line 395, in reraise
raise self.exc_type(msg)
IndexError: Caught IndexError in DataLoader worker process 0.
Original Traceback (most recent call last):
File “/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/worker.py”, line 185, in _worker_loop
data = fetcher.fetch(index)
File “/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/fetch.py”, line 44, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File “/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/fetch.py”, line 44, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File “/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataset.py”, line 207, in getitem
return self.datasets[dataset_idx][sample_idx]
File “/jetson-inference/python/training/detection/ssd/vision/datasets/voc_dataset.py”, line 81, in getitem
image, boxes, labels = self.transform(image, boxes, labels)
File “/jetson-inference/python/training/detection/ssd/vision/ssd/data_preprocessing.py”, line 34, in call
return self.augment(img, boxes, labels)
File “/jetson-inference/python/training/detection/ssd/vision/transforms/transforms.py”, line 55, in call
img, boxes, labels = t(img, boxes, labels)
File “/jetson-inference/python/training/detection/ssd/vision/transforms/transforms.py”, line 277, in call
overlap = jaccard_numpy(boxes, rect)
File “/jetson-inference/python/training/detection/ssd/vision/transforms/transforms.py”, line 30, in jaccard_numpy
inter = intersect(box_a, box_b)
File “/jetson-inference/python/training/detection/ssd/vision/transforms/transforms.py”, line 13, in intersect
max_xy = np.minimum(box_a[:, 2:], box_b[2:])
IndexError: too many indices for array: array is 1-dimensional, but 2 were indexed

What’s the reason

Hi,

It seems that there are some issues when loading the custom ‘data/myTrain’ dataset.
Do you meet any errors when using the default ‘Open Images’ dataset?

Thanks.

It appears that one of the XML files in your dataset is invalid, or had invalid bounding box data. To find out which it is, uncomment this line of code inside the container (i.e. using nano text editor):

https://github.com/dusty-nv/pytorch-ssd/blob/3f9ba554e33260c8c493a927d7c4fdaa3f388e72/vision/datasets/voc_dataset.py#L76

Then run train_ssd.py with these options: --batch-size=1 --num-workers=1 --debug-steps=1

The last image info to get printed out before the exception occurs is the one that is causing the problem.

If you continue having issues with it, you can send me your dataset and I can try it.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.