Error training and converting to onnx with custom dataset

Just starting to learn on Jetson nano 2GB and I’m having issues with the “Hello AI World” tutorial on collecting and creating custom dataset for re-training ssd-mobilenet. Any help would be appreciated!

first, I get an error 21 when I try to convert to onnx:

running on device cuda:0
found best checkpoint with loss 10000.000000 ()
creating network: ssd-mobilenet
num classes: 6
loading checkpoint: models/stuff3/
Traceback (most recent call last):
File “onnx_export.py”, line 86, in
net.load(args.input)
File “/jetson-inference/python/training/detection/ssd/vision/ssd/ssd.py”, line 135, in load
self.load_state_dict(torch.load(model, map_location=lambda storage, loc: storage))
File “/usr/local/lib/python3.6/dist-packages/torch/serialization.py”, line 571, in load
with _open_file_like(f, ‘rb’) as opened_file:
File “/usr/local/lib/python3.6/dist-packages/torch/serialization.py”, line 229, in _open_file_like
return _open_file(name_or_buffer, mode)
File “/usr/local/lib/python3.6/dist-packages/torch/serialization.py”, line 210, in init
super(_open_file, self).init(open(name, mode))
IsADirectoryError: [Errno 21] Is a directory: ‘models/stuff3/’

I see the directory on desktop so I know it is there. Also, I noticed that as I train my dataset, the epochs register “nan” toward the end of its training:

2021-02-21 15:47:45 - Epoch: 1, Validation Loss: nan, Validation Regression Loss nan, Validation Classification Loss: 2.0592
2021-02-21 15:47:45 - Saved model models/stuff3/mb1-ssd-Epoch-1-Loss-nan.pth
2021-02-21 15:47:45 - Task done, exiting program.

This must be part of the problem? I captured my images via camera-capture and followed the “Training Object Detection Models” video tutorial to make the custom dataset. I had no issue with the earlier exercise of downloading the fruit images from Open Images dataset v6 and retraining ssd-mobilenet. But I also notice that there are more files and folders inside the “fruit” data folder than what I have in my custom dataset folder.

I’ve tried disabling GUI when running training and onnx convert. Also increased swap size, as recommended. I still get error when converting to onnx. Still get “nan” for validation loss and regression loss. Can someone point me to what I’m doing wrong?

Thank you!
Paul

Hi @operator,
I could be wrong since I’m not seeing the code neither the commands / configuration you are using.

From the error message it looks as if you are passing a directory (“models/stuff3/”) to something that expects a filename (like “models/stuff3/mb1-ssd-Epoch-1-Loss-nan.pth”).

Best Regards,
Juan Pablo.

Hi @operator, the issue is that the onnx_export.py script tries to parse the file names to find the one with lowest loss, but the only one in your folder has nan loss.

https://github.com/dusty-nv/pytorch-ssd/blob/e7b5af50a157c50d3bab8f55089ce57c2c812f37/onnx_export.py#L39

Instead you could try running it as:

$ python3 onnx_export.py --input=models/stuff3/mb1-ssd-Epoch-1-Loss-nan.pth --output=models/stuff3/ssd-mobilenet.onnx

However since the loss is nan, the model is unlikely to have been trained correctly. When you train it with train_ssd.py, you may want to try lowering the learning rate with --learning-rate=0.005 argument (the default learning rate is 0.01). Also, how many epochs are you training it for and how many images are in your dataset?

Hi Dusty! Thanks for your help I appreciate it. I have tried dropping the learning rate. I have tried different number of epochs, from 1 to 5. But “nan” still shows up. I have over 100 images in my dataset, for 5 different objects, all captured and annotated using camera-capture, following your video tutorial. I checked “merge sets” so assume the images are copied between the train, val, and test folders. Would nan have anything to do with the number of images? Or something with the train, val, and test folders?

What was the lowest learning rate that you tried? If you went down to --learning-rate=0.001 or 0.0005 did it still make nan?

Are the objects in your dataset very small or otherwise challenging? You could also upload the dataset somewhere and I can inspect it and give it a try.

That’s so cool thanks for the offer to look at the dataset! Here it is (69mb compressed):

The objects are just stuffed animals. Images were captured on Raspberry Pi camera module and a lightbox. I have tried taking learning rate even down to 0.0001 and “nan” still occurs. Also tried taking batch-size to 1 and kept workers at 1. No luck. Any insights would be appreciated!

OK, so after adding some additional logging to print out the losses each image, I found two of the XML files had large negative y-coordinates in one of their bounding boxes:

20210118-214543
20210118-214556

Upon removing these from test.txt/train.txt/trainval.txt/val.txt under ImageSets/Main, the model has been training normally (without nan). So remove those two image ID’s from the ImageSets and it should train for you.

Not sure how those large negative y-coordinates got there (you will see <ymax>-753862144</ymax> if you inspect those two XML files) - may be a bug in camera-capture tool, sorry about that. I should add more checks to the train_ssd scripts to avoid those conditions.

Thank you! I will remove those two image IDs and try training again. I really appreciate this!