Re-training ResNet-18 model on Jetson nano | how many epochs? The model shows only one class on everything after re-training

rita4ka · May 7, 2022, 11:22am

Hello people!

I am really new to Linux, image detection and python, so assume that I don’t have any base and the answer might be really simple.

I am following Jetson AI fundamentals on my Jetson Nano 2 GB, season 3 with Dusty. In his videos Dusty runs all the examples with 1 epoch and afterwards he show the re-trained result.

My question: the re-trained result that Dusty shows, not achieved by 1 epoch, right?
So, if I’m trying to re-train with my data set (or for instance with cat_dog dataset) I need to cycle at least 30 epochs to get around 80% accuracy, correct?

Therefore, if I am running only one epoch, it is normal that everything is recognized as a Dog?

P.S: I tried to use the cat_dog_100_epochs and it recognizes pretty well.

Thanks!

NVES · May 7, 2022, 11:37am

Hi,

This looks like a Jetson issue. Please refer to the below samples in case useful.

For any further assistance, we will move this post to to Jetson related forum.

Thanks!

rita4ka · May 7, 2022, 11:44am

Thanks, sorry if I misplaced the topic in the wrong forum.

I followed the dusty-nv/jetson-inference GitHub step-by-step, the question above came after running it several times on several examples.

rita4ka · May 8, 2022, 5:37pm

Update: I trained the model for 30 epochs, it recognizes everything as a dog (with various probabilities), while I tried it on the cat test data set .

I faced the same issue when I tried to build my own dataset, it only popped out one class.

It seems that no matter what dataset I use or for how many epochs I train my model it shows only one class on everything…

This is what I did with the cat_dog dataset:

python3 train.py --model-dir=models/cat_dog data/cat_dog --batch-size=4 --workers=1 --epochs=30

Then exported it to onnx:

python3 onnx_export.py --model-dir=models/cat_dog

Then tested it on the cat test dataset using resnet18, and saved it to cat_dog dir:

imagenet --model=models/cat_dog/resnet18.onnx --input_blob=input_0 --output_blob=output_0 --labels=data/cat_dog/labels.txt \
           data/cat_dog/test/cat data/cat_dog/cat_test_%i.jpg

I already tried to wipe the Jetson and re-install everything (latest Jetpack image) from scratch - same result…

AastaLLL · May 9, 2022, 3:17am

Hi,

Do you follow the steps shared in the below document?
Is there any error shown?

github.com

dusty-nv/jetson-inference/blob/master/docs/pytorch-cat-dog.md

<img src="https://github.com/dusty-nv/jetson-inference/raw/master/docs/images/deep-vision-header.jpg" width="100%">
<p align="right"><sup><a href="pytorch-transfer-learning.md">Back</a> | <a href="pytorch-plants.md">Next</a> | </sup><a href="../README.md#hello-ai-world"><sup>Contents</sup></a>
<br/>
<sup>Transfer Learning - Classification</sup></s></p>

# Re-training on the Cat/Dog Dataset

The first model that we'll be re-training is a simple model that recognizes two classes:  cat or dog.

<img src="https://github.com/dusty-nv/jetson-inference/raw/python/docs/images/pytorch-cat-dog.jpg" width="700">

Provided below is an 800MB dataset that includes 5000 training images, 1000 validation images, and 200 test images, each evenly split between the cat and dog classes.  The set of training images is used for transfer learning, while the validation set is used to evaluate classification accuracy during training, and the test images are to be used by us after training completes.  The network is never directly trained on the validation and test sets, only the training set.

The images from the dataset are made up of many different breeds of dogs and cats, including large felines like tigers and mountain lions since the amount of cat images available was a bit lower than dogs.  Some of the images also picture humans, which the detector is essentially trained to ignore as background and focus on the cat vs. dog content.

To get started, first make sure that you have [PyTorch installed](pytorch-transfer-learning.md#installing-pytorch) on your Jetson, then download the dataset below and kick off the training script.  After that, we'll test the re-trained model in TensorRT on some static images and a live camera feed. 

## Downloading the Data

During this tutorial, we'll store the datasets on the host device under `jetson-inference/python/training/classification/data`, which is one of the directories that is automatically [mounted into the container](aux-docker.md#mounted-data-volumes).  This way the dataset won't be lost when you shutdown the container.

This file has been truncated. show original

Thanks.

dusty_nv · May 9, 2022, 7:23pm

Hi @rita4ka, yea as you have found you are correct, I just trained for 1 epoch in the video to speed up the video. It depends on the size of your dataset, but in reality I typically train at least 30 epochs.

When you trained your model with PyTorch, what was the accuracy that it reported at the end of the training process?

Can you try deleting the *.engine file from your model’s directory, and try running imagenet program again?

Can you also try my pre-trained cat/dog model to rule out any issue with your jetson-inference installation? https://nvidia.box.com/s/zlvb4y43djygotpjn6azjhwu0r3j0yxc

rita4ka · May 12, 2022, 5:40pm

Hi all,

@AastaLLL @dusty_nv
Firstly, big thank you guys for helping me to figure it out, and everything you created here to make it easier for people like me (with no background) to hoop on the inference\DL\AI train :)

@AastaLLL - yes, I followed the steps to the point (only customized the batch size in order to make it easier on my 2GB nano).
I did not get any errors.

@dusty_nv
The accuracy was around 67% after each epoch , however it was the same exact number, something like 67.513 (I don’t remember exactly) - which I found weird.

I did run your 100 epochs onnx file and it recognized the cats\dogs pretty well (except the Siamese cats, which he thought were dogs).

However, I have an update - meanwhile I followed the ROS2 installation (here too, I am learning as I go) as my final goal is an autonomous robot, and for that I also did the steps of “building the project from source” - which I did not do until now as I worked exclusively with the container, and after that I successfully trained 40 epochs cat_dog model which do recognizes cats and dogs pretty well (and presents both labels accordingly).

While building the project from source, I noticed that I was missing a lot of CUDA drivers (?) or something like that and it took a good 40 minutes to install it - maybe this have something to do with the fix.

Clarification: I did the ROS2 installation before I saw you reply of trying to delete the *.engine file, and after running the training again (and it worked) I did not delete anything, however, I did create a brand new folder for this training as I gave it a different name (so I believe the *.engine file was newly created for this run).

I thought that building the container was enough for the inference recognition projects, but maybe I misunderstood the guide\YouTube tutorials?

Soon I will collect my own dataset for detection project, hope I won’t run again to one label problem :)

Thank you all!!!

system · June 8, 2022, 2:41am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.