Jetson nano start the Docker an error occurred while training your detection model ：Segmentation fault (core dumped)

2648356432 · April 11, 2022, 6:50am

root@nano-desktop:/jetson-inference# python3 train_ssd.py --dataset-type=voc --data=data/myTrain --model-dir=myModel --batch-size=2 --workers=1 --epochs=1
python3: can’t open file ‘train_ssd.py’: [Errno 2] No such file or directory
root@nano-desktop:/jetson-inference# cd python
root@nano-desktop:/jetson-inference/python# cd training
root@nano-desktop:/jetson-inference/python/training# cd detection
root@nano-desktop:/jetson-inference/python/training/detection# cd ssd
root@nano-desktop:/jetson-inference/python/training/detection/ssd# python3 train_ssd.py --dataset-type=voc --data=data/myTrain --model-dir=myModel --batch-size=2 --workers=1 --epochs=1
2022-04-11 06:27:26 - Using CUDA…
2022-04-11 06:27:26 - Namespace(balance_data=False, base_net=None, base_net_lr=0.001, batch_size=2, checkpoint_folder=‘myModel’, dataset_type=‘voc’, datasets=[‘data/myTrain’], debug_steps=10, extra_layers_lr=None, freeze_base_net=False, freeze_net=False, gamma=0.1, lr=0.01, mb2_width_mult=1.0, milestones=‘80,100’, momentum=0.9, net=‘mb1-ssd’, num_epochs=1, num_workers=1, pretrained_ssd=‘models/mobilenet-v1-ssd-mp-0_675.pth’, resume=None, scheduler=‘cosine’, t_max=100, use_cuda=True, validation_epochs=1, weight_decay=0.0005)
2022-04-11 06:27:26 - Prepare training datasets.
warning - image 20220411-141059 has no box/labels annotations, ignoring from dataset
2022-04-11 06:27:26 - VOC Labels read from file: (‘BACKGROUND’, ‘A’, ‘B’)
2022-04-11 06:27:26 - Stored labels into file myModel/labels.txt.
2022-04-11 06:27:26 - Train dataset size: 23
2022-04-11 06:27:26 - Prepare Validation datasets.
warning - image 20220411-141059 has no box/labels annotations, ignoring from dataset
2022-04-11 06:27:26 - VOC Labels read from file: (‘BACKGROUND’, ‘A’, ‘B’)
2022-04-11 06:27:26 - Validation dataset size: 23
2022-04-11 06:27:26 - Build network.
2022-04-11 06:27:26 - Init from pretrained ssd models/mobilenet-v1-ssd-mp-0_675.pth
2022-04-11 06:27:27 - Took 0.40 seconds to load the model.
2022-04-11 06:27:36 - Learning rate: 0.01, Base net learning rate: 0.001, Extra Layers learning rate: 0.01.
2022-04-11 06:27:36 - Uses CosineAnnealingLR scheduler.
2022-04-11 06:27:36 - Start training from epoch 0.
/usr/local/lib/python3.6/dist-packages/torch/optim/lr_scheduler.py:123: UserWarning: Detected call of lr_scheduler.step() before optimizer.step(). In PyTorch 1.1.0 and later, you should call them in the opposite order: optimizer.step() before lr_scheduler.step(). Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at torch.optim — PyTorch 1.12 documentation
“torch.optim — PyTorch 1.12 documentation”, UserWarning)
/usr/local/lib/python3.6/dist-packages/torch/nn/_reduction.py:44: UserWarning: size_average and reduce args will be deprecated, please use reduction=‘sum’ instead.
warnings.warn(warning.format(ret))
2022-04-11 06:27:52 - Epoch: 0, Step: 10/12, Avg Loss: 10.0826, Avg Regression Loss 3.5129, Avg Classification Loss: 6.5696
2022-04-11 06:27:58 - Epoch: 0, Validation Loss: 11.0960, Validation Regression Loss 3.6998, Validation Classification Loss: 7.3962
2022-04-11 06:27:58 - Saved model myModel/mb1-ssd-Epoch-0-Loss-11.09600555896759.pth
2022-04-11 06:27:58 - Task done, exiting program.
Segmentation fault (core dumped)

2648356432 · April 11, 2022, 7:04am

QQ图片20220411150419

2648356432 · April 11, 2022, 7:06am

The generated model file is locked

AastaLLL · April 11, 2022, 8:38am

Hi,

First, please try if add more memory as below can help the segmentation error:

Since you write the file with the root account (docker), please open it with root or change it to other owners first.
Thanks.

2648356432 · April 11, 2022, 10:47am

root@nano-desktop:/jetson-inference/python/training/detection/ssd# python3 train_ssd.py --dataset-type=voc --data=data/fand --model-dir=models/fand --batch-size=2 --workers=1 --epochs=1
2022-04-11 10:39:35 - Using CUDA…
2022-04-11 10:39:35 - Namespace(balance_data=False, base_net=None, base_net_lr=0.001, batch_size=2, checkpoint_folder=‘models/fand’, dataset_type=‘voc’, datasets=[‘data/fand’], debug_steps=10, extra_layers_lr=None, freeze_base_net=False, freeze_net=False, gamma=0.1, lr=0.01, mb2_width_mult=1.0, milestones=‘80,100’, momentum=0.9, net=‘mb1-ssd’, num_epochs=1, num_workers=1, pretrained_ssd=‘models/mobilenet-v1-ssd-mp-0_675.pth’, resume=None, scheduler=‘cosine’, t_max=100, use_cuda=True, validation_epochs=1, weight_decay=0.0005)
2022-04-11 10:39:35 - Prepare training datasets.
2022-04-11 10:39:35 - No labels file, using default VOC classes.
2022-04-11 10:39:35 - Stored labels into file models/fand/labels.txt.
2022-04-11 10:39:35 - Train dataset size: 20
2022-04-11 10:39:35 - Prepare Validation datasets.
2022-04-11 10:39:35 - No labels file, using default VOC classes.
2022-04-11 10:39:35 - Validation dataset size: 18
2022-04-11 10:39:35 - Build network.
2022-04-11 10:39:36 - Init from pretrained ssd models/mobilenet-v1-ssd-mp-0_675.pth
2022-04-11 10:39:36 - Took 0.43 seconds to load the model.
2022-04-11 10:39:46 - Learning rate: 0.01, Base net learning rate: 0.001, Extra Layers learning rate: 0.01.
2022-04-11 10:39:46 - Uses CosineAnnealingLR scheduler.
2022-04-11 10:39:46 - Start training from epoch 0.
/usr/local/lib/python3.6/dist-packages/torch/optim/lr_scheduler.py:123: UserWarning: Detected call of lr_scheduler.step() before optimizer.step(). In PyTorch 1.1.0 and later, you should call them in the opposite order: optimizer.step() before lr_scheduler.step(). Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at torch.optim — PyTorch 1.12 documentation
“torch.optim — PyTorch 1.12 documentation”, UserWarning)
warning - image 20220411-103513 has object with unknown class ‘B’
warning - image 20220411-103556 has object with unknown class ‘B’
Traceback (most recent call last):
File “train_ssd.py”, line 343, in
device=DEVICE, debug_steps=args.debug_steps, epoch=epoch)
File “train_ssd.py”, line 113, in train
for i, data in enumerate(loader):
File “/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py”, line 354, in next
data = self._next_data()
File “/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py”, line 980, in _next_data
return self._process_data(data)
File “/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py”, line 1005, in _process_data
data.reraise()
File “/usr/local/lib/python3.6/dist-packages/torch/_utils.py”, line 395, in reraise
raise self.exc_type(msg)
IndexError: Caught IndexError in DataLoader worker process 0.
Original Traceback (most recent call last):
File “/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/worker.py”, line 185, in _worker_loop
data = fetcher.fetch(index)
File “/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/fetch.py”, line 44, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File “/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/fetch.py”, line 44, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File “/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataset.py”, line 207, in getitem
return self.datasets[dataset_idx][sample_idx]
File “/jetson-inference/python/training/detection/ssd/vision/datasets/voc_dataset.py”, line 81, in getitem
image, boxes, labels = self.transform(image, boxes, labels)
File “/jetson-inference/python/training/detection/ssd/vision/ssd/data_preprocessing.py”, line 34, in call
return self.augment(img, boxes, labels)
File “/jetson-inference/python/training/detection/ssd/vision/transforms/transforms.py”, line 55, in call
img, boxes, labels = t(img, boxes, labels)
File “/jetson-inference/python/training/detection/ssd/vision/transforms/transforms.py”, line 277, in call
overlap = jaccard_numpy(boxes, rect)
File “/jetson-inference/python/training/detection/ssd/vision/transforms/transforms.py”, line 30, in jaccard_numpy
inter = intersect(box_a, box_b)
File “/jetson-inference/python/training/detection/ssd/vision/transforms/transforms.py”, line 13, in intersect
max_xy = np.minimum(box_a[:, 2:], box_b[2:])
IndexError: too many indices for array: array is 1-dimensional, but 2 were indexed

What’s the reason

AastaLLL · April 21, 2022, 6:44am

Hi,

It seems that there are some issues when loading the custom ‘data/myTrain’ dataset.
Do you meet any errors when using the default ‘Open Images’ dataset?

github.com

dusty-nv/jetson-inference/blob/master/docs/pytorch-ssd.md

<img src="https://github.com/dusty-nv/jetson-inference/raw/master/docs/images/deep-vision-header.jpg" width="100%">
<p align="right"><sup><a href="pytorch-collect.md">Back</a> | <a href="pytorch-collect-detection.md">Next</a> | </sup><a href="../README.md#hello-ai-world"><sup>Contents</sup></a>
<br/>
<sup>Transfer Learning - Object Detection</sup></s></p>

# Re-training SSD-Mobilenet

Next, we'll train our own SSD-Mobilenet object detection model using PyTorch and the [Open Images](https://storage.googleapis.com/openimages/web/visualizer/index.html?set=train&type=detection&c=%2Fm%2F06l9r) dataset.  SSD-Mobilenet is a popular network architecture for realtime object detection on mobile and embedded devices that combines the [SSD-300](https://arxiv.org/abs/1512.02325) Single-Shot MultiBox Detector with a [Mobilenet](https://arxiv.org/abs/1704.04861) backbone.  

<a href="https://arxiv.org/abs/1512.02325"><img src="https://github.com/dusty-nv/jetson-inference/raw/dev/docs/images/pytorch-ssd-mobilenet.jpg"></a>

In the example below, we'll train a custom detection model that locates 8 different varieties of fruit, although you are welcome to pick from any of the [600 classes](https://github.com/dusty-nv/pytorch-ssd/blob/master/open_images_classes.txt) in the Open Images dataset to train your model on.  You can visually browse the dataset [here](https://storage.googleapis.com/openimages/web/visualizer/index.html?set=train&type=detection).

<img src="https://github.com/dusty-nv/jetson-inference/raw/dev/docs/images/pytorch-fruit.jpg">

To get started, first make sure that you have [JetPack 4.4](https://developer.nvidia.com/embedded/jetpack) or newer and [PyTorch installed](pytorch-transfer-learning.md#installing-pytorch) for **Python 3.6** on your Jetson.  JetPack 4.4 includes TensorRT 7.1, which is the minimum TensorRT version that supports loading SSD-Mobilenet via ONNX.  And the PyTorch training scripts used for training SSD-Mobilenet are for Python3, so PyTorch should be installed for Python 3.6.

## Setup

> **note:** first make sure that you have [JetPack 4.4](https://developer.nvidia.com/embedded/jetpack) or newer on your Jetson and [PyTorch installed](pytorch-transfer-learning.md#installing-pytorch) for **Python 3.6**

This file has been truncated. show original

Thanks.

dusty_nv · April 21, 2022, 4:08pm

It appears that one of the XML files in your dataset is invalid, or had invalid bounding box data. To find out which it is, uncomment this line of code inside the container (i.e. using nano text editor):

https://github.com/dusty-nv/pytorch-ssd/blob/3f9ba554e33260c8c493a927d7c4fdaa3f388e72/vision/datasets/voc_dataset.py#L76

Then run train_ssd.py with these options: --batch-size=1 --num-workers=1 --debug-steps=1

The last image info to get printed out before the exception occurs is the one that is causing the problem.

If you continue having issues with it, you can send me your dataset and I can try it.

system · May 11, 2022, 6:35am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Re-training SSD-Mobilenet: gt_locations consist of nan values which causing Regression Loss to NaN Jetson Nano ai-training	2	922	September 13, 2022
Hello AI World - new object detection training and video interfaces Jetson Nano	29	4492	April 20, 2021
Jetson nano - train model for my own object detection Jetson Nano ai-training	11	4461	October 15, 2021
Deep Learning Inference Benchmarking Instructions Jetson Nano	134	47567	May 30, 2023
Problems with train_ssd.py Jetson Nano	2	1018	October 14, 2021
Training with "train_ssd.py" - error at the end of the dataset Jetson AGX Xavier	6	1246	October 18, 2021
Dusty-nv jetson training custom data sets generating labels Jetson Nano ai-training	27	4414	October 15, 2021
Train custom object detectio model Jetson Nano ai-training	12	3031	October 18, 2021
Train_ssd.py indices error Jetson Nano jetson-inference	12	1720	December 15, 2021
Error during re-training SSD-Mobilenet using Jetson Nano 2GB Jetson Nano jetson-inference	4	18	April 16, 2025

Jetson nano start the Docker an error occurred while training your detection model ：Segmentation fault (core dumped)

Related topics