Hello! I want to ask that I am in a problem of getting the above error of segmentation fault (core dumped) on jetson nano when training resnet-18 . the training actually starts and goes on for some time i.e. for 4 minutes but after some epochs the training crashes and the above mentioned error shows up. Any idea??
Segmentation fault (core dumped) on jetson nano when training resnet-18 on my small dataset of just 60 images using transfer learning!
A possible reason is out of memory.
Please noticed that Nano is an embedded system and only has 4G memory that not suitable for training.
It’s recommended to use a desktop GPU for transfer learning usage.
But the same code was run by a person in a tutorial and I have provided the link with this message. That person has not run into any of the issue. See the link below:
@farjadhaider3253 did you mount SWAP memory? Recommend that you keep an eye on
sudo tegrastats in another terminal while the training is running to keep an eye on the memory.
Also, close any other windows, web browsers, ect that you may have open on your Nano at the same time.
Another way to reclaim more memory, is to disable the GUI while you are training. You can shut it down like this, and then restart it after you are done training:
Also, it occurred to me that if your dataset is only 60 images - perhaps it’s possible that the training has actually finished, and the segfault happens when the script exits?
Try adding a
print('done training') at the end of the script to check. If the segfault happens after
done training, then it is fine and you can ignore it.
Currently, there may be a crash in PyTorch when Python is shutting down:
Thank you very much @dusty_nv for your help. I had mounted SWAP memory before training. Secondly the thing that my dataset was small and training might got ended, so i will definitely add a print statement and get back to you in case found any error. Furthermore, please tell me that what is this GUI disable means, I mean that I have seen the stackexchange link which @dusty_nv has mentioned but whats GUI?
It means the Ubuntu desktop. When you disable it, all you will see is a full-screen terminal. However it does save a lot of memory.
It sounds like you might not actually have run out of memory though - typically you would see a “killed” message if you ran out of memory (this is when the Linux Out-of-Memory killer “kills” your process to prevent the system from totally running out of memory).
My guess is PyTorch crashed on the exit due to the issues I linked to above, and not actually during training. You should also be able to tell by seeing if it saved your trained model checkpoints or not.
Yes the trained model checkpoints were saved. Two files were actually saved, i.e. one the checkpoint and other the best model checkpoint but they were saved just after 5 minutes and then that Segmentation fault message was displayed in the terminal.