I tried to design a system that able to detect a human face wearing a face mask. But, after following the guide provided at ‘Hello AI World : Re-training SSD-Mobilenet’, I found out that my Jetson Nano 2GB are not capable to re-trained a model from a datasets downloaded from ‘Open Images Dataset’ even though with a data sets contain 500 pictures.
jetson-inference/pytorch-ssd.md at master · dusty-nv/jetson-inference · GitHub
In my opinion, this issue is due to the minimal hardware resources it have (2GB RAM). Most of the times it will failed/stopped/killed through out the training process as shown in the picture I attached below.
I think the best choice I have is to download and utilize any detection models along with the jetson-inference tools. But, I have no idea on how to do so or even if it is possible. Hopefully there is a guide and solution for this problem.
Hi @faiz26, to reduce memory usage on your Nano 2GB, you may want to try these suggestions from the tutorial:
Alternatively, if you have a PC/laptop with GPU or cloud instance, you can run the pytorch-ssd training code from the tutorial on a PC. You would need to have PyTorch and the other Python dependencies installed on it.
But, I trained the detection model with my own datasets that I annotate and download from CVAT (900 pictures). Anyways, I am more than happy because it works for my application :D
There are a few things that I did;
- Reflash my USB drive (250GB, SSD, Boot From USB)
- Disabling the ZRAM
- Mount additional disk swap, total swap I have 8.1GB
- Disabling the Desktop GUI and operating the Jetson via SSH (PuTTY)
- Run the detection training program via docker (inside the container). Previously I was not using the docker script, because I build the project from source
Train on my own datasets. I think the image resolution/quality are much lower compered to the ‘Open Images’ ones.
- I intentionally add another cooling fan (utilizing external power source) to maintain the Jetson at lower temperature. Because sometimes the A0 sensor may go up to 50 degree Celsius or even more during the training session. This additional fan manage to maintain A0 sensor below 50 degree celcius.
- Able to train successfully at batch-size=2, workers=1, and epochs=30
- Most of the time the RAM fluctuates between 1.5GB to 1.9GB
- Swap memory utilization are quite low, most of the time less than 2GB
- GPU utilization are very high, maintains at 99% most of the time. If it drops and stay at 0 for a significant amount of time (2-3 minutes), usually the training session already failed/killed based on my observation.