I cannot train a detection model. I get the error: RuntimeError: Error in loading state_dict for SSD: Unexpected key(s) in state_dict:

This is the complete error:

RuntimeError: Error(s) in loading state_dict for SSD:
Unexpected key(s) in state_dict: “base_net.1.0.weight”, “base_net.1.1.weight”, “base_net.1.1.bias”, “base_net.1.1.running_mean”, “base_net.1.1.running_var”, “base_net.1.3.weight”, “base_net.1.4.weight”, “base_net.1.4.bias”, “base_net.1.4.running_mean”, “base_net.1.4.running_var”, “base_net.2.0.weight”, “base_net.2.1.weight”, “base_net.2.1.bias”, “base_net.2.1.running_mean”, “base_net.2.1.running_var”, “base_net.2.3.weight”, “base_net.2.4.weight”, “base_net.2.4.bias”, “base_net.2.4.running_mean”, “base_net.2.4.running_var”, “base_net.3.0.weight”, “base_net.3.1.weight”, “base_net.3.1.bias”, “base_net.3.1.running_mean”, “base_net.3.1.running_var”, “base_net.3.3.weight”, “base_net.3.4.weight”, “base_net.3.4.bias”, “base_net.3.4.running_mean”, “base_net.3.4.running_var”, “base_net.4.0.weight”, “base_net.4.1.weight”, “base_net.4.1.bias”, “base_net.4.1.running_mean”, “base_net.4.1.running_var”, “base_net.4.3.weight”, “base_net.4.4.weight”, “base_net.4.4.bias”, “base_net.4.4.running_mean”, “base_net.4.4.running_var”, “base_net.5.0.weight”, “base_net.5.1.weight”, “base_net.5.1.bias”, “base_net.5.1.running_mean”, “base_net.5.1.running_var”, “base_net.5.3.weight”, “base_net.5.4.weight”, “base_net.5.4.bias”, “base_net.5.4.running_mean”, “base_net.5.4.running_var”, “base_net.6.0.weight”, “base_net.6.1.weight”, “base_net.6.1.bias”, “base_net.6.1.running_mean”, “base_net.6.1.running_var”, “base_net.6.3.weight”, “base_net.6.4.weight”, “base_net.6.4.bias”, “base_net.6.4.running_mean”, “base_net.6.4.running_var”, “base_net.7.0.weight”, “base_net.7.1.weight”, “base_net.7.1.bias”, “base_net.7.1.running_mean”, “base_net.7.1.running_var”, “base_net.7.3.weight”, “base_net.7.4.weight”, “base_net.7.4.bias”, “base_net.7.4.running_mean”, “base_net.7.4.running_var”, “base_net.8.0.weight”, “base_net.8.1.weight”, “base_net.8.1.bias”, “base_net.8.1.running_mean”, “base_net.8.1.running_var”, “base_net.8.3.weight”, “base_net.8.4.weight”, “base_net.8.4.bias”, “base_net.8.4.running_mean”, “base_net.8.4.running_var”, “base_net.9.0.weight”, “base_net.9.1.weight”, “base_net.9.1.bias”, “base_net.9.1.running_mean”, “base_net.9.1.running_var”, “base_net.9.3.weight”, “base_net.9.4.weight”, “base_net.9.4.bias”, “base_net.9.4.running_mean”, “base_net.9.4.running_var”, “base_net.10.0.weight”, “base_net.10.1.weight”, “base_net.10.1.bias”, “base_net.10.1.running_mean”, “base_net.10.1.running_var”, “base_net.10.3.weight”, “base_net.10.4.weight”, “base_net.10.4.bias”, “base_net.10.4.running_mean”, “base_net.10.4.running_var”, “base_net.11.0.weight”, “base_net.11.1.weight”, “base_net.11.1.bias”, “base_net.11.1.running_mean”, “base_net.11.1.running_var”, “base_net.11.3.weight”, “base_net.11.4.weight”, “base_net.11.4.bias”, “base_net.11.4.running_mean”, “base_net.11.4.running_var”, “base_net.12.0.weight”, “base_net.12.1.weight”, “base_net.12.1.bias”, “base_net.12.1.running_mean”, “base_net.12.1.running_var”, “base_net.12.3.weight”, “base_net.12.4.weight”, “base_net.12.4.bias”, “base_net.12.4.running_mean”, “base_net.12.4.running_var”, “base_net.13.0.weight”, “base_net.13.1.weight”, “base_net.13.1.bias”, “base_net.13.1.running_mean”, “base_net.13.1.running_var”, “base_net.13.3.weight”, “base_net.13.4.weight”, “base_net.13.4.bias”, “base_net.13.4.running_mean”, “base_net.13.4.running_var”, “extras.0.0.weight”, “extras.0.0.bias”, “extras.0.2.weight”, “extras.0.2.bias”, “extras.1.0.weight”, “extras.1.0.bias”, “extras.1.2.weight”, “extras.1.2.bias”, “extras.2.0.weight”, “extras.2.0.bias”, “extras.2.2.weight”, “extras.2.2.bias”, “extras.3.0.weight”, “extras.3.0.bias”, “extras.3.2.weight”, “extras.3.2.bias”.

Thanks in advance.

PD: I am using the code in dusty-nv repository called “jetson-inference”. I am doing the same as the instructor on the tutorials for training detection models, but I get the error. I don’t know why is this happening.

What’s the Jetson platform you used?

Hi,

This error might occur if net.to("cuda:0") is not working.

How do you install PyTorch for Jetson?
Could you try our package shared in the below topic?

Thanks.

I dont have that error anymore, but now, in the epoch 25 out of 30, I am getting the following error:

2021-06-17 21:09:09 - Epoch: 25, Step: 370/1455, Avg Loss: 4.7977, Avg Regression Loss 2.1495, Avg Classification Loss: 2.6482
2021-06-17 21:09:14 - Epoch: 25, Step: 380/1455, Avg Loss: 4.2170, Avg Regression Loss 1.5167, Avg Classification Loss: 2.7003
2021-06-17 21:09:19 - Epoch: 25, Step: 390/1455, Avg Loss: 4.1998, Avg Regression Loss 1.4369, Avg Classification Loss: 2.7629
ERROR: Unexpected segmentation fault encountered in worker.
Traceback (most recent call last):
File “/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py”, line 779, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File “/usr/lib/python3.6/multiprocessing/queues.py”, line 104, in get
if not self._poll(timeout):
File “/usr/lib/python3.6/multiprocessing/connection.py”, line 257, in poll
return self._poll(timeout)
File “/usr/lib/python3.6/multiprocessing/connection.py”, line 414, in _poll
r = wait([self], timeout)
File “/usr/lib/python3.6/multiprocessing/connection.py”, line 911, in wait
ready = selector.select(timeout)
File “/usr/lib/python3.6/selectors.py”, line 376, in select
fd_event_list = self._poll.poll(timeout)
File “/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/signal_handling.py”, line 66, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 380) is killed by signal: Segmentation fault.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “train_ssd.py”, line 343, in
device=DEVICE, debug_steps=args.debug_steps, epoch=epoch)
File “train_ssd.py”, line 113, in train
for i, data in enumerate(loader):
File “/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py”, line 363, in next
data = self._next_data()
File “/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py”, line 974, in _next_data
idx, data = self._get_data()
File “/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py”, line 941, in _get_data
success, data = self._try_get_data()
File “/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py”, line 792, in _try_get_data
raise RuntimeError(‘DataLoader worker (pid(s) {}) exited unexpectedly’.format(pids_str))
RuntimeError: DataLoader worker (pid(s) 380) exited unexpectedly

Hi @100375195, can you keep an eye on the memory usage (with tegrastats or jtop) to see if maybe it ran out of memory?

You may want to try running with --batch-size 2 --workers 1 to decrease the load.

Thanks for your answer!

Ok. I a will train with 1 worker and batch size = 2. I will see if it works.

It worked!! Thanks very much