I am getting "ERROR: Unexpected segmentation fault encountered in worker." in the middle of the training of a detection model

This is the complete message I got, in the middle of epoch 25 (out of 30)

ERROR: Unexpected segmentation fault encountered in worker.
Traceback (most recent call last):
File “/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py”, line 779, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File “/usr/lib/python3.6/multiprocessing/queues.py”, line 104, in get
if not self._poll(timeout):
File “/usr/lib/python3.6/multiprocessing/connection.py”, line 257, in poll
return self._poll(timeout)
File “/usr/lib/python3.6/multiprocessing/connection.py”, line 414, in _poll
r = wait([self], timeout)
File “/usr/lib/python3.6/multiprocessing/connection.py”, line 911, in wait
ready = selector.select(timeout)
File “/usr/lib/python3.6/selectors.py”, line 376, in select
fd_event_list = self._poll.poll(timeout)
File “/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/signal_handling.py”, line 66, in handler
RuntimeError: DataLoader worker (pid 380) is killed by signal: Segmentation fault.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “train_ssd.py”, line 343, in
device=DEVICE, debug_steps=args.debug_steps, epoch=epoch)
File “train_ssd.py”, line 113, in train
for i, data in enumerate(loader):
File “/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py”, line 363, in next
data = self._next_data()
File “/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py”, line 974, in _next_data
idx, data = self._get_data()
File “/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py”, line 941, in _get_data
success, data = self._try_get_data()
File “/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py”, line 792, in _try_get_data
raise RuntimeError(‘DataLoader worker (pid(s) {}) exited unexpectedly’.format(pids_str))
RuntimeError: DataLoader worker (pid(s) 380) exited unexpectedly

I do not know what could be happening. Thanks in advance.

This issue is because of two possible causes:

  1. If you were training in a docker container, its share memory was not enough, you need recreate a container with --ipc=host to remove share memory limit;
  2. If you were training on a host, maybe the opencv version you were using has problem, its internal threads caused a dead lock under the context of dataloader.py, you can install the latest pytorch1.9, or in dataloader, disable using threads in opencv:
    def getitem(self, idx):
    import cv2