Train_ssd.py error - Training Object Detection Models

Hello. I have a Jetson Nano 4gb and hate to admit but I have been trying to resolve the error for over a week - I beg for any help as I have almost given up entirely and appreciate any input. This is based on a tutorial video below:

Jetson AI Fundamentals - S3E5 - Training Object Detection Models - YouTube

At 16:45 in the video below I am at the point of running the command:
python3 train_ssd.py --dataset-type=voc --data=data/stuff --model-dir=models/stuff --batch-size=2–workers=1 --epochs=1

nvidia@ubuntu:~/Desktop/jetson-inference/python/training/detection/ssd$ python3 train_ssd.py --dataset-type=voc --data=data/stuff --model-dir=models/stuff --batch-size=2–workers=1 --epochs=1
Traceback (most recent call last):
File “train_ssd.py”, line 18, in
from vision.ssd.vgg_ssd import create_vgg_ssd
File “/home/nvidia/Desktop/jetson-inference/python/training/detection/ssd/vision/ssd/vgg_ssd.py”, line 6, in
from .predictor import Predictor
File “/home/nvidia/Desktop/jetson-inference/python/training/detection/ssd/vision/ssd/predictor.py”, line 4, in
from .data_preprocessing import PredictionTransform
File “/home/nvidia/Desktop/jetson-inference/python/training/detection/ssd/vision/ssd/data_preprocessing.py”, line 1, in
from …transforms.transforms import *
File “/home/nvidia/Desktop/jetson-inference/python/training/detection/ssd/vision/transforms/transforms.py”, line 5, in
from torchvision import transforms
File “/usr/local/lib/python3.6/dist-packages/torchvision-0.7.0a0+78ed10c-py3.6-linux-aarch64.egg/torchvision/init.py”, line 6, in
from torchvision import datasets
File “/usr/local/lib/python3.6/dist-packages/torchvision-0.7.0a0+78ed10c-py3.6-linux-aarch64.egg/torchvision/datasets/init.py”, line 1, in
from .lsun import LSUN, LSUNClass
File “/usr/local/lib/python3.6/dist-packages/torchvision-0.7.0a0+78ed10c-py3.6-linux-aarch64.egg/torchvision/datasets/lsun.py”, line 2, in
from PIL import Image
File “”, line 971, in _find_and_load
File “”, line 955, in _find_and_load_unlocked
File “”, line 656, in _load_unlocked
File “”, line 626, in _load_backward_compatible
File “/usr/local/lib/python3.6/dist-packages/Pillow-9.2.0-py3.6-linux-aarch64.egg/PIL/Image.py”, line 52, in
File “”, line 971, in _find_and_load
File “”, line 951, in _find_and_load_unlocked
File “”, line 894, in _find_spec
File “”, line 1157, in find_spec
File “”, line 1131, in _get_spec
File “”, line 1112, in _legacy_get_spec
File “”, line 441, in spec_from_loader
File “”, line 544, in spec_from_file_location
File “/usr/local/lib/python3.6/dist-packages/Pillow-9.2.0-py3.6-linux-aarch64.egg/PIL/_deprecate.py”, line 1
SyntaxError: future feature annotations is not defined
nvidia@ubuntu:~/Desktop/jetson-inference/python/training/detection/ssd$

Hi,

Could you reinstall the Pillow library with the below command and try it again?

$ pip3 install 'pillow<9'

Thanks.

Well that worked perfectly - thank you so much!

Now moving forward one step to do the train.py step for the images I collected, I get the error below about not being able to allocate memory. I have a Seeed Jetson Nano 4gb J1020. I am almost there, I appreciate anything to get me to my goal of making my own practice data set.

I am also confused as to another video of train_SSD.py. Wasn’t sure what that was about.

nvidia@ubuntu:~/Desktop/jetson-inference/python/training/classification$ python3 train.py --model-dir=models/tools --batch-size=1 --workers=1 --epochs=1 data/tools/
Use GPU: 0 for training
=> dataset classes: 4 [‘background’, ‘screwdriver’, ‘socket’, ‘visegrip’]
=> using pre-trained model ‘resnet18’
=> reshaped ResNet fully-connected layer with: Linear(in_features=512, out_features=4, bias=True)
Epoch: [0][ 0/20] Time 14.106 (14.106) Data 0.404 ( 0.404) Loss 2.1107e+00 (2.1107e+00) Acc@1 0.00 ( 0.00) Acc@5 100.00 (100.00)
Epoch: [0][10/20] Time 0.152 ( 1.426) Data 0.000 ( 0.038) Loss 5.4006e+01 (3.1639e+01) Acc@1 0.00 ( 18.18) Acc@5 100.00 (100.00)
Epoch: [0] completed, elapsed time 17.193 seconds
Traceback (most recent call last):
File “train.py”, line 521, in
main()
File “train.py”, line 143, in main
main_worker(args.gpu, ngpus_per_node, args)
File “train.py”, line 288, in main_worker
acc1 = validate(val_loader, model, criterion, epoch, num_classes, args)
File “train.py”, line 383, in validate
for i, (images, target) in enumerate(val_loader):
File “/home/nvidia/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 359, in iter
return self._get_iterator()
File “/home/nvidia/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 305, in _get_iterator
return _MultiProcessingDataLoaderIter(self)
File “/home/nvidia/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 918, in init
w.start()
File “/usr/lib/python3.6/multiprocessing/process.py”, line 105, in start
self._popen = self._Popen(self)
File “/usr/lib/python3.6/multiprocessing/context.py”, line 223, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File “/usr/lib/python3.6/multiprocessing/context.py”, line 277, in _Popen
return Popen(process_obj)
File “/usr/lib/python3.6/multiprocessing/popen_fork.py”, line 19, in init
self._launch(process_obj)
File “/usr/lib/python3.6/multiprocessing/popen_fork.py”, line 66, in _launch
self.pid = os.fork()
OSError: [Errno 12] Cannot allocate memory
nvidia@ubuntu:~/Desktop/jetson-inference/python/training/classification$

Hi,

OSError: [Errno 12] Cannot allocate memory

It looks like Nano run out of memory.
Could you please check the system status with tegrastats to see if memory is fully occupied?

$ sudo tegrastats

Thanks.

Hi @azdan2007, if you haven’t already, can you try mounting additional SWAP memory, disabling ZRAM, and if needed disable the desktop GUI? You can find instructions for this here: https://github.com/dusty-nv/jetson-inference/blob/master/docs/pytorch-transfer-learning.md#mounting-swap

That fixed my error message - thank you so much. I built a model and ran imagenet but didn’t detect my 2 objects. No worries though, I likely did something wrong and will retry later this week. I was a bit confused because I thought the images you get when using camera-capture are supposed be in the data/train folder, which they were. It errored out and said there were 0 files found: in the “val” folder. Do I copy the images from the train folder into the val folder?

The excellent video here is what I am viewing: Jetson AI Fundamentals - S3E3 - Training Image Classification Models which is different than at the start of the thread.

nvidia@ubuntu:~/Desktop/classification$ python3 train.py --model-dir=models/funny2 --batch-size=1 --workers=1 --epochs=3 data/funny2/
Use GPU: 0 for training
=> dataset classes: 2 [‘background’, ‘pen’]
Traceback (most recent call last):
File “train.py”, line 521, in
main()
File “train.py”, line 143, in main
main_worker(args.gpu, ngpus_per_node, args)
File “train.py”, line 199, in main_worker
normalize,
File “/usr/local/lib/python3.6/dist-packages/torchvision-0.9.0a0+01dfa8e-py3.6-linux-aarch64.egg/torchvision/datasets/folder.py”, line 256, in init
is_valid_file=is_valid_file)
File “/usr/local/lib/python3.6/dist-packages/torchvision-0.9.0a0+01dfa8e-py3.6-linux-aarch64.egg/torchvision/datasets/folder.py”, line 132, in init
raise RuntimeError(msg)
RuntimeError: Found 0 files in subfolders of: data/funny2/val
Supported extensions are: .jpg,.jpeg,.png,.ppm,.bmp,.pgm,.tif,.tiff,.webp
nvidia@ubuntu:~/Desktop/classification$

I actually tried copying the image folders from the train folder and into the val folder and everything ran. But in the end, when I ran imagenet the background and pen did not seem to be detected. Hmmm. Almost there!

python3 onnx_export.py --model-dir=models/funny2

imagenet --model=models/funny2/resnet18.onnx --labels=data/funny2/labels.txt --input_blob=input_0 --output_blob=output_0 /dev/video0

Hi @azdan2007, glad you got it running! My guess is that you need to train it for more epochs - like 30 epochs instead of 3 epochs. PyTorch reports the accuracy during training. So you can stop the training when it reaches a satisfactory accuracy.

Did for 25 epochs and it worked! Fantastic, thank you! My accuracy is only about 65% but I am OK with that. I only did 2 objects with 25 photos each. You all are great! Now on to more learning with the videos and online tutorial pages.

Not sure if you answered it above but final question and we can close this:
I thought the images you get when using camera-capture are supposed be in the data/train folder, which they were. It errored out and said there were 0 files found: in the “val” folder.

Do I always have to copy the images from the train folder into the val folder? It’s not a problem but just didn’t see that in the instructions.

OK, great! - glad you got it working. If you wanted to further increase the accuracy, I would recommend collecting more images in your dataset.

You should either collect different images for the val folder (if you are using the camera-capture tool, change the Current Set drop-down to val), or copy the train images to val. For an actual ‘production’ model, you would want to collect different images. However if you are just playing around and testing things, you can just copy them over.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.