Training the SSD-Mobilenet Model Problem

savunmasad · October 9, 2020, 11:36am

When training ssd in jetson-inference/python/training/detection/ssd/

with

python3 train_ssd.py --data=data/gemi --model-dir=models/gemi --batch-size=4 --epochs=30

it shows me this:

2020-10-09 14:28:09 - Epoch: 0, Step: 6470/6880, Avg Loss: 4.9857, Avg Regression Loss 2.5101, Avg Classification Loss: 2.4756
ERROR: Unexpected segmentation fault encountered in worker.
Traceback (most recent call last):
  File "/home/sadsavunma/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 779, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File "/usr/lib/python3.6/multiprocessing/queues.py", line 104, in get
    if not self._poll(timeout):
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 257, in poll
    return self._poll(timeout)
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 414, in _poll
    r = wait([self], timeout)
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 911, in wait
    ready = selector.select(timeout)
  File "/usr/lib/python3.6/selectors.py", line 376, in select
    fd_event_list = self._poll.poll(timeout)
  File "/home/sadsavunma/.local/lib/python3.6/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 24483) is killed by signal: Segmentation fault. 

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "train_ssd.py", line 343, in <module>
    device=DEVICE, debug_steps=args.debug_steps, epoch=epoch)
  File "train_ssd.py", line 113, in train
    for i, data in enumerate(loader):
  File "/home/sadsavunma/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 363, in __next__
    data = self._next_data()
  File "/home/sadsavunma/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 974, in _next_data
    idx, data = self._get_data()
  File "/home/sadsavunma/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 941, in _get_data
    success, data = self._try_get_data()
  File "/home/sadsavunma/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 792, in _try_get_data
    raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str))

dusty_nv · October 9, 2020, 5:48pm

Can you try running train_ssd.py with --workers=0 option? Also you may want to try --batch-size=2 as well.

If you check dmesg, do you see any messages about out of memory? Can you keep an eye on the memory utilization with tegrastats? Also you may want to mount swap if you have not already done so.

If you are training on a custom dataset, the other thing could be is one of the images/annotations is corrupt or malformed.

savunmasad · October 9, 2020, 6:37pm

Flashed the jetson