Unexpected Segmentation Fault encountered in Worker

Dear Experts,
I am training my custom data using Jetson Xavier NX development board. I have two classes of object.
In my training, I use
python3 train_ssd.py --dataset-type=voc --data=data/Food --model-dir=models/Food --batch-size=4 --epochs=100

At 51th epochs, i got Unexpected segmentation fault encounter in worker.

ERROR: Unexpected segmentation fault encountered in worker.
Traceback (most recent call last):
File “/home/gbewegung/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 779, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File “/usr/lib/python3.6/multiprocessing/queues.py”, line 104, in get
if not self._poll(timeout):
File “/usr/lib/python3.6/multiprocessing/connection.py”, line 257, in poll
return self._poll(timeout)
File “/usr/lib/python3.6/multiprocessing/connection.py”, line 414, in _poll
r = wait([self], timeout)
File “/usr/lib/python3.6/multiprocessing/connection.py”, line 911, in wait
ready = selector.select(timeout)
File “/usr/lib/python3.6/selectors.py”, line 376, in select
fd_event_list = self._poll.poll(timeout)
File “/home/gbewegung/.local/lib/python3.6/site-packages/torch/utils/data/_utils/signal_handling.py”, line 66, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 13276) is killed by signal: Segmentation fault.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “train_ssd.py”, line 396, in
train(train_loader, net, criterion, optimizer, device=DEVICE, debug_steps=args.debug_steps, epoch=epoch)
File “train_ssd.py”, line 133, in train
for i, data in enumerate(loader):
File “/home/gbewegung/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 363, in next
data = self._next_data()
File “/home/gbewegung/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 974, in _next_data
idx, data = self._get_data()
File “/home/gbewegung/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 941, in _get_data
success, data = self._try_get_data()
File “/home/gbewegung/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 792, in _try_get_data
raise RuntimeError(‘DataLoader worker (pid(s) {}) exited unexpectedly’.format(pids_str))

Note: Jetson Xavier NX trained 100 epochs when i used 1200 training images. But when i increase to 1600 images, i got the segmentation fault

Hi,

Which sample do you use?

How many workers are launched in the app?
Could you check if setting NUM_WORKERS=1 can help?

Thanks.

Hi,
Which sample do you use?
If i understand correct, i use images from google search.

I set to num_workers=0 .But i got an error.

Thanks

Hi,

Please try NUM_WORKERS=1.
Do you use the train_ssd.py from our sample?

Thank.s

I tried NUM_WORKERS=1. I got segmentation fault. I use jetson-inference ssd for object detecton.

Hi @jaganathan.comm, I would recommend running train_ssd.py with --workers=1 --batch-size=1 --debug-steps=1 --log-level=debug and it will print out the image it is loading during the training. This should allow you to see which one is causing the segfault (perhaps there is a corrupt image)

1 Like

Yes.There are lot of corrupted training images. Lot of time to remove it. I got those images from google search engine.
May i know that how do you check the corrupted images. What logic to check corrupted images? In which line of the code it is checking.

There isn’t code built into train_ssd.py that detects corrupted images - that can be difficult to quantify (other than the image just fails to load, it can be too small or the image itself can have aberrations, ect). If needed, you could write a Python or bash script though that loops over all the images in your directory and attempts to load them all though.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.