Re-training ResNet-18 model on Jetson nano | how many epochs? The model shows only one class on everything after re-training

Hello people!

I am really new to Linux, image detection and python, so assume that I don’t have any base and the answer might be really simple.

I am following Jetson AI fundamentals on my Jetson Nano 2 GB, season 3 with Dusty. In his videos Dusty runs all the examples with 1 epoch and afterwards he show the re-trained result.

My question: the re-trained result that Dusty shows, not achieved by 1 epoch, right?
So, if I’m trying to re-train with my data set (or for instance with cat_dog dataset) I need to cycle at least 30 epochs to get around 80% accuracy, correct?

Therefore, if I am running only one epoch, it is normal that everything is recognized as a Dog?

P.S: I tried to use the cat_dog_100_epochs and it recognizes pretty well.

Thanks!

Hi,

This looks like a Jetson issue. Please refer to the below samples in case useful.

For any further assistance, we will move this post to to Jetson related forum.

Thanks!

1 Like

Thanks, sorry if I misplaced the topic in the wrong forum.

I followed the dusty-nv/jetson-inference GitHub step-by-step, the question above came after running it several times on several examples.

Update: I trained the model for 30 epochs, it recognizes everything as a dog (with various probabilities), while I tried it on the cat test data set .

I faced the same issue when I tried to build my own dataset, it only popped out one class.

It seems that no matter what dataset I use or for how many epochs I train my model it shows only one class on everything…

This is what I did with the cat_dog dataset:

python3 train.py --model-dir=models/cat_dog data/cat_dog --batch-size=4 --workers=1 --epochs=30

Then exported it to onnx:

python3 onnx_export.py --model-dir=models/cat_dog

Then tested it on the cat test dataset using resnet18, and saved it to cat_dog dir:

imagenet --model=models/cat_dog/resnet18.onnx --input_blob=input_0 --output_blob=output_0 --labels=data/cat_dog/labels.txt \
           data/cat_dog/test/cat data/cat_dog/cat_test_%i.jpg

I already tried to wipe the Jetson and re-install everything (latest Jetpack image) from scratch - same result…

Hi,

Do you follow the steps shared in the below document?
Is there any error shown?

Thanks.

1 Like

Hi @rita4ka, yea as you have found you are correct, I just trained for 1 epoch in the video to speed up the video. It depends on the size of your dataset, but in reality I typically train at least 30 epochs.

When you trained your model with PyTorch, what was the accuracy that it reported at the end of the training process?

Can you try deleting the *.engine file from your model’s directory, and try running imagenet program again?

Can you also try my pre-trained cat/dog model to rule out any issue with your jetson-inference installation? https://nvidia.box.com/s/zlvb4y43djygotpjn6azjhwu0r3j0yxc

1 Like

Hi all,

@AastaLLL @dusty_nv
Firstly, big thank you guys for helping me to figure it out, and everything you created here to make it easier for people like me (with no background) to hoop on the inference\DL\AI train :)

@AastaLLL - yes, I followed the steps to the point (only customized the batch size in order to make it easier on my 2GB nano).
I did not get any errors.

@dusty_nv
The accuracy was around 67% after each epoch , however it was the same exact number, something like 67.513 (I don’t remember exactly) - which I found weird.

I did run your 100 epochs onnx file and it recognized the cats\dogs pretty well (except the Siamese cats, which he thought were dogs).

However, I have an update - meanwhile I followed the ROS2 installation (here too, I am learning as I go) as my final goal is an autonomous robot, and for that I also did the steps of “building the project from source” - which I did not do until now as I worked exclusively with the container, and after that I successfully trained 40 epochs cat_dog model which do recognizes cats and dogs pretty well (and presents both labels accordingly).

While building the project from source, I noticed that I was missing a lot of CUDA drivers (?) or something like that and it took a good 40 minutes to install it - maybe this have something to do with the fix.

Clarification: I did the ROS2 installation before I saw you reply of trying to delete the *.engine file, and after running the training again (and it worked) I did not delete anything, however, I did create a brand new folder for this training as I gave it a different name (so I believe the *.engine file was newly created for this run).

I thought that building the container was enough for the inference recognition projects, but maybe I misunderstood the guide\YouTube tutorials?

Soon I will collect my own dataset for detection project, hope I won’t run again to one label problem :)

Thank you all!!!