Train_ssd.py error - Training Object Detection Models

azdan2007 · October 1, 2022, 4:59am

Hello. I have a Jetson Nano 4gb and hate to admit but I have been trying to resolve the error for over a week - I beg for any help as I have almost given up entirely and appreciate any input. This is based on a tutorial video below:

Jetson AI Fundamentals - S3E5 - Training Object Detection Models - YouTube

At 16:45 in the video below I am at the point of running the command:
python3 train_ssd.py --dataset-type=voc --data=data/stuff --model-dir=models/stuff --batch-size=2–workers=1 --epochs=1

nvidia@ubuntu:~/Desktop/jetson-inference/python/training/detection/ssd$ python3 train_ssd.py --dataset-type=voc --data=data/stuff --model-dir=models/stuff --batch-size=2–workers=1 --epochs=1
Traceback (most recent call last):
File “train_ssd.py”, line 18, in
from vision.ssd.vgg_ssd import create_vgg_ssd
File “/home/nvidia/Desktop/jetson-inference/python/training/detection/ssd/vision/ssd/vgg_ssd.py”, line 6, in
from .predictor import Predictor
File “/home/nvidia/Desktop/jetson-inference/python/training/detection/ssd/vision/ssd/predictor.py”, line 4, in
from .data_preprocessing import PredictionTransform
File “/home/nvidia/Desktop/jetson-inference/python/training/detection/ssd/vision/ssd/data_preprocessing.py”, line 1, in
from …transforms.transforms import *
File “/home/nvidia/Desktop/jetson-inference/python/training/detection/ssd/vision/transforms/transforms.py”, line 5, in
from torchvision import transforms
File “/usr/local/lib/python3.6/dist-packages/torchvision-0.7.0a0+78ed10c-py3.6-linux-aarch64.egg/torchvision/init.py”, line 6, in
from torchvision import datasets
File “/usr/local/lib/python3.6/dist-packages/torchvision-0.7.0a0+78ed10c-py3.6-linux-aarch64.egg/torchvision/datasets/init.py”, line 1, in
from .lsun import LSUN, LSUNClass
File “/usr/local/lib/python3.6/dist-packages/torchvision-0.7.0a0+78ed10c-py3.6-linux-aarch64.egg/torchvision/datasets/lsun.py”, line 2, in
from PIL import Image
File “”, line 971, in _find_and_load
File “”, line 955, in _find_and_load_unlocked
File “”, line 656, in _load_unlocked
File “”, line 626, in _load_backward_compatible
File “/usr/local/lib/python3.6/dist-packages/Pillow-9.2.0-py3.6-linux-aarch64.egg/PIL/Image.py”, line 52, in
File “”, line 971, in _find_and_load
File “”, line 951, in _find_and_load_unlocked
File “”, line 894, in _find_spec
File “”, line 1157, in find_spec
File “”, line 1131, in _get_spec
File “”, line 1112, in _legacy_get_spec
File “”, line 441, in spec_from_loader
File “”, line 544, in spec_from_file_location
File “/usr/local/lib/python3.6/dist-packages/Pillow-9.2.0-py3.6-linux-aarch64.egg/PIL/_deprecate.py”, line 1
SyntaxError: future feature annotations is not defined
nvidia@ubuntu:~/Desktop/jetson-inference/python/training/detection/ssd$

AastaLLL · October 3, 2022, 2:54am

Hi,

Could you reinstall the Pillow library with the below command and try it again?

$ pip3 install 'pillow<9'

Thanks.

azdan2007 · October 4, 2022, 3:27am

Well that worked perfectly - thank you so much!

Now moving forward one step to do the train.py step for the images I collected, I get the error below about not being able to allocate memory. I have a Seeed Jetson Nano 4gb J1020. I am almost there, I appreciate anything to get me to my goal of making my own practice data set.

I am also confused as to another video of train_SSD.py. Wasn’t sure what that was about.

nvidia@ubuntu:~/Desktop/jetson-inference/python/training/classification$ python3 train.py --model-dir=models/tools --batch-size=1 --workers=1 --epochs=1 data/tools/
Use GPU: 0 for training
=> dataset classes: 4 [‘background’, ‘screwdriver’, ‘socket’, ‘visegrip’]
=> using pre-trained model ‘resnet18’
=> reshaped ResNet fully-connected layer with: Linear(in_features=512, out_features=4, bias=True)
Epoch: [0][ 0/20] Time 14.106 (14.106) Data 0.404 ( 0.404) Loss 2.1107e+00 (2.1107e+00) Acc@1 0.00 ( 0.00) Acc@5 100.00 (100.00)
Epoch: [0][10/20] Time 0.152 ( 1.426) Data 0.000 ( 0.038) Loss 5.4006e+01 (3.1639e+01) Acc@1 0.00 ( 18.18) Acc@5 100.00 (100.00)
Epoch: [0] completed, elapsed time 17.193 seconds
Traceback (most recent call last):
File “train.py”, line 521, in
main()
File “train.py”, line 143, in main
main_worker(args.gpu, ngpus_per_node, args)
File “train.py”, line 288, in main_worker
acc1 = validate(val_loader, model, criterion, epoch, num_classes, args)
File “train.py”, line 383, in validate
for i, (images, target) in enumerate(val_loader):
File “/home/nvidia/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 359, in iter
return self._get_iterator()
File “/home/nvidia/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 305, in _get_iterator
return _MultiProcessingDataLoaderIter(self)
File “/home/nvidia/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 918, in init
w.start()
File “/usr/lib/python3.6/multiprocessing/process.py”, line 105, in start
self._popen = self._Popen(self)
File “/usr/lib/python3.6/multiprocessing/context.py”, line 223, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File “/usr/lib/python3.6/multiprocessing/context.py”, line 277, in _Popen
return Popen(process_obj)
File “/usr/lib/python3.6/multiprocessing/popen_fork.py”, line 19, in init
self._launch(process_obj)
File “/usr/lib/python3.6/multiprocessing/popen_fork.py”, line 66, in _launch
self.pid = os.fork()
OSError: [Errno 12] Cannot allocate memory
nvidia@ubuntu:~/Desktop/jetson-inference/python/training/classification$

AastaLLL · October 4, 2022, 4:10am

Hi,

OSError: [Errno 12] Cannot allocate memory

It looks like Nano run out of memory.
Could you please check the system status with tegrastats to see if memory is fully occupied?

$ sudo tegrastats

Thanks.

dusty_nv · October 4, 2022, 5:50pm

Hi @azdan2007, if you haven’t already, can you try mounting additional SWAP memory, disabling ZRAM, and if needed disable the desktop GUI? You can find instructions for this here: https://github.com/dusty-nv/jetson-inference/blob/master/docs/pytorch-transfer-learning.md#mounting-swap

azdan2007 · October 5, 2022, 3:32am

That fixed my error message - thank you so much. I built a model and ran imagenet but didn’t detect my 2 objects. No worries though, I likely did something wrong and will retry later this week. I was a bit confused because I thought the images you get when using camera-capture are supposed be in the data/train folder, which they were. It errored out and said there were 0 files found: in the “val” folder. Do I copy the images from the train folder into the val folder?

The excellent video here is what I am viewing: Jetson AI Fundamentals - S3E3 - Training Image Classification Models which is different than at the start of the thread.

nvidia@ubuntu:~/Desktop/classification$ python3 train.py --model-dir=models/funny2 --batch-size=1 --workers=1 --epochs=3 data/funny2/
Use GPU: 0 for training
=> dataset classes: 2 [‘background’, ‘pen’]
Traceback (most recent call last):
File “train.py”, line 521, in
main()
File “train.py”, line 143, in main
main_worker(args.gpu, ngpus_per_node, args)
File “train.py”, line 199, in main_worker
normalize,
File “/usr/local/lib/python3.6/dist-packages/torchvision-0.9.0a0+01dfa8e-py3.6-linux-aarch64.egg/torchvision/datasets/folder.py”, line 256, in init
is_valid_file=is_valid_file)
File “/usr/local/lib/python3.6/dist-packages/torchvision-0.9.0a0+01dfa8e-py3.6-linux-aarch64.egg/torchvision/datasets/folder.py”, line 132, in init
raise RuntimeError(msg)
RuntimeError: Found 0 files in subfolders of: data/funny2/val
Supported extensions are: .jpg,.jpeg,.png,.ppm,.bmp,.pgm,.tif,.tiff,.webp
nvidia@ubuntu:~/Desktop/classification$

azdan2007 · October 5, 2022, 4:14am

I actually tried copying the image folders from the train folder and into the val folder and everything ran. But in the end, when I ran imagenet the background and pen did not seem to be detected. Hmmm. Almost there!

python3 onnx_export.py --model-dir=models/funny2

imagenet --model=models/funny2/resnet18.onnx --labels=data/funny2/labels.txt --input_blob=input_0 --output_blob=output_0 /dev/video0

dusty_nv · October 5, 2022, 3:16pm

Hi @azdan2007, glad you got it running! My guess is that you need to train it for more epochs - like 30 epochs instead of 3 epochs. PyTorch reports the accuracy during training. So you can stop the training when it reaches a satisfactory accuracy.

azdan2007 · October 6, 2022, 3:11am

Did for 25 epochs and it worked! Fantastic, thank you! My accuracy is only about 65% but I am OK with that. I only did 2 objects with 25 photos each. You all are great! Now on to more learning with the videos and online tutorial pages.

Not sure if you answered it above but final question and we can close this:
I thought the images you get when using camera-capture are supposed be in the data/train folder, which they were. It errored out and said there were 0 files found: in the “val” folder.

Do I always have to copy the images from the train folder into the val folder? It’s not a problem but just didn’t see that in the instructions.

dusty_nv · October 6, 2022, 2:24pm

OK, great! - glad you got it working. If you wanted to further increase the accuracy, I would recommend collecting more images in your dataset.

You should either collect different images for the val folder (if you are using the camera-capture tool, change the Current Set drop-down to val), or copy the train images to val. For an actual ‘production’ model, you would want to collect different images. However if you are just playing around and testing things, you can just copy them over.

system · October 26, 2022, 6:08am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Train_ssd.py dosen't work with pascal voc dataset Jetson Nano ai-training	5	1125	February 9, 2022
Problems with train_ssd.py Jetson Nano	2	1018	October 14, 2021
Jetson nano start the Docker an error occurred while training your detection model ：Segmentation fault (core dumped) Jetson Nano jetson-inference	7	1234	April 21, 2022
Train_ssd.py indices error Jetson Nano jetson-inference	12	1720	December 15, 2021
PLEASE HELP: nvidia Jetson 2GB training fails - TypeError: __init__() missing 1 required positional argument: 'dtype' Jetson Nano ai-training	6	2383	March 2, 2022
Re-training SSD-Mobilenet: gt_locations consist of nan values which causing Regression Loss to NaN Jetson Nano ai-training	2	922	September 13, 2022
Train_ssd.py - Could not find image warning Jetson Orin Nano jetson-inference	7	117	August 13, 2024
Jetson nano - train model for my own object detection Jetson Nano ai-training	11	4460	October 15, 2021
Jetson Inference Train Issue TensorRT	1	505	December 14, 2020
Hello AI World - new object detection training and video interfaces Jetson Nano	29	4491	April 20, 2021

Train_ssd.py error - Training Object Detection Models

Related topics