Error training and converting to onnx with custom dataset

operator · February 23, 2021, 6:56pm

Just starting to learn on Jetson nano 2GB and I’m having issues with the “Hello AI World” tutorial on collecting and creating custom dataset for re-training ssd-mobilenet. Any help would be appreciated!

first, I get an error 21 when I try to convert to onnx:

running on device cuda:0
found best checkpoint with loss 10000.000000 ()
creating network: ssd-mobilenet
num classes: 6
loading checkpoint: models/stuff3/
Traceback (most recent call last):
File “onnx_export.py”, line 86, in
net.load(args.input)
File “/jetson-inference/python/training/detection/ssd/vision/ssd/ssd.py”, line 135, in load
self.load_state_dict(torch.load(model, map_location=lambda storage, loc: storage))
File “/usr/local/lib/python3.6/dist-packages/torch/serialization.py”, line 571, in load
with _open_file_like(f, ‘rb’) as opened_file:
File “/usr/local/lib/python3.6/dist-packages/torch/serialization.py”, line 229, in _open_file_like
return _open_file(name_or_buffer, mode)
File “/usr/local/lib/python3.6/dist-packages/torch/serialization.py”, line 210, in init
super(_open_file, self).init(open(name, mode))
IsADirectoryError: [Errno 21] Is a directory: ‘models/stuff3/’

I see the directory on desktop so I know it is there. Also, I noticed that as I train my dataset, the epochs register “nan” toward the end of its training:

2021-02-21 15:47:45 - Epoch: 1, Validation Loss: nan, Validation Regression Loss nan, Validation Classification Loss: 2.0592
2021-02-21 15:47:45 - Saved model models/stuff3/mb1-ssd-Epoch-1-Loss-nan.pth
2021-02-21 15:47:45 - Task done, exiting program.

This must be part of the problem? I captured my images via camera-capture and followed the “Training Object Detection Models” video tutorial to make the custom dataset. I had no issue with the earlier exercise of downloading the fruit images from Open Images dataset v6 and retraining ssd-mobilenet. But I also notice that there are more files and folders inside the “fruit” data folder than what I have in my custom dataset folder.

I’ve tried disabling GUI when running training and onnx convert. Also increased swap size, as recommended. I still get error when converting to onnx. Still get “nan” for validation loss and regression loss. Can someone point me to what I’m doing wrong?

Thank you!
Paul

juan.tettamanti · February 23, 2021, 7:11pm

Hi @operator,
I could be wrong since I’m not seeing the code neither the commands / configuration you are using.

From the error message it looks as if you are passing a directory (“models/stuff3/”) to something that expects a filename (like “models/stuff3/mb1-ssd-Epoch-1-Loss-nan.pth”).

Best Regards,
Juan Pablo.

dusty_nv · February 23, 2021, 8:15pm

Hi @operator, the issue is that the onnx_export.py script tries to parse the file names to find the one with lowest loss, but the only one in your folder has nan loss.

https://github.com/dusty-nv/pytorch-ssd/blob/e7b5af50a157c50d3bab8f55089ce57c2c812f37/onnx_export.py#L39

Instead you could try running it as:

$ python3 onnx_export.py --input=models/stuff3/mb1-ssd-Epoch-1-Loss-nan.pth --output=models/stuff3/ssd-mobilenet.onnx

However since the loss is nan, the model is unlikely to have been trained correctly. When you train it with train_ssd.py, you may want to try lowering the learning rate with --learning-rate=0.005 argument (the default learning rate is 0.01). Also, how many epochs are you training it for and how many images are in your dataset?

operator · February 23, 2021, 8:36pm

Hi Dusty! Thanks for your help I appreciate it. I have tried dropping the learning rate. I have tried different number of epochs, from 1 to 5. But “nan” still shows up. I have over 100 images in my dataset, for 5 different objects, all captured and annotated using camera-capture, following your video tutorial. I checked “merge sets” so assume the images are copied between the train, val, and test folders. Would nan have anything to do with the number of images? Or something with the train, val, and test folders?

dusty_nv · February 23, 2021, 8:42pm

What was the lowest learning rate that you tried? If you went down to --learning-rate=0.001 or 0.0005 did it still make nan?

Are the objects in your dataset very small or otherwise challenging? You could also upload the dataset somewhere and I can inspect it and give it a try.

operator · February 23, 2021, 8:58pm

That’s so cool thanks for the offer to look at the dataset! Here it is (69mb compressed):

https://we.tl/t-Oh8llxBhx3

The objects are just stuffed animals. Images were captured on Raspberry Pi camera module and a lightbox. I have tried taking learning rate even down to 0.0001 and “nan” still occurs. Also tried taking batch-size to 1 and kept workers at 1. No luck. Any insights would be appreciated!

dusty_nv · February 23, 2021, 9:31pm

OK, so after adding some additional logging to print out the losses each image, I found two of the XML files had large negative y-coordinates in one of their bounding boxes:

20210118-214543
20210118-214556

Upon removing these from test.txt/train.txt/trainval.txt/val.txt under ImageSets/Main, the model has been training normally (without nan). So remove those two image ID’s from the ImageSets and it should train for you.

Not sure how those large negative y-coordinates got there (you will see <ymax>-753862144</ymax> if you inspect those two XML files) - may be a bug in camera-capture tool, sorry about that. I should add more checks to the train_ssd scripts to avoid those conditions.

operator · February 23, 2021, 10:31pm

Thank you! I will remove those two image IDs and try training again. I really appreciate this!

maruf.ahmad · June 16, 2021, 8:43am

Hello Dusty, I have the same problem with ONNX. Please give me an easy solution.

Thanks.

Maruf

dusty_nv · June 16, 2021, 2:40pm

Hi @maruf.ahmad, I think you also need to specify --labels=models/Ahmad/labels.txt when you run onnx_export.py

maruf.ahmad · June 17, 2021, 5:56am

Hi Dusty,

Thank you very much for your prompt reply. Problem solved. I would like to ask one more question to you. Let’s I have lot of images in my hard disk. How can I make those images a dataset? I want to make a similar dataset which you showed by webcam instead want to make by my hard drive images. By using webcam the dataset, rules, folders xml and all makes automatically but how we can make the exact folder by our hard drive images quickly?

Thanks.

Maruf

maruf.ahmad · June 18, 2021, 10:02am

Hi Dusty, I have one more question. According to your video we can detect object with my-detection.py file. But how we can count the people and display it on the console? Let’s say I want to print in every second how much total people in the video frame while video is running? I can make it with open cv currently but i want to use your code. I am sharing the both code with you. Please help me what code I need to add in my-detection.py file to count people in every second. Thanks.

Maruf

Topic		Replies	Views
Using re-trained model inside Python script Jetson Nano ai-training	14	2017	October 15, 2021
Train_ssd.py error - Training Object Detection Models Jetson Nano ai-training	10	1415	October 6, 2022
Jetson nano start the Docker an error occurred while training your detection model ：Segmentation fault (core dumped) Jetson Nano jetson-inference	7	1234	April 21, 2022
ONNX Model auf Jetson Nano Using Jetson Nano onnx	13	991	January 5, 2023
Train_ssh.py only works with one dataset; other one returns Loss: nan Jetson Nano ai-training	4	617	October 15, 2021
Using jetson nano i conducted a training of my own model for object detection with the help of trained model but it shows an error and it is below Jetson Nano jetson-inference	18	909	July 15, 2022
Train custom object detectio model Jetson Nano ai-training	12	3034	October 18, 2021
Failing custom object detection Jetson Nano jetson-inference	6	523	January 5, 2022
.onnx file convert to trt got error Jetson TX2 tensorrt , jetson-inference	19	1443	January 25, 2023
Jetson Nano - train_ssd.py example doesn't detect anything Jetson Nano ai-training	8	1570	October 15, 2021

Error training and converting to onnx with custom dataset

Related topics