Jetson-inference: cannot train model with custom data set

I’m following the instructions here for training a custom object detection neural network using jetson-inference: https://github.com/dusty-nv/jetson-inference/blob/master/docs/pytorch-collect-detection.md

I created a custom dataset using CVAT and exported it in PASCAL-VOC format, with the images included. I am trying to get the network to identify pieces of wood.

I mounted a SWAP file and disabled the desktop GUI. Then I navigate to jetson-inference/python/training/detection/ssd and run the following command:

python3 train_ssd.py --dataset-type=voc --data=data/wood --model-dir=models/wood --batch-size=2 --workers=0 --epochs=1

This results in the following output:

2022-02-08 11:43:48 - Using CUDA…
2022-02-08 11:43:48 - Namespace(balance_data=False, base_net=None, base_net_lr=0.001, batch_size=2, checkpoint_folder=‘models/wood’, dataset_type=‘voc’, datasets=[‘data/wood’], debug_steps=10, extra_layers_lr=None, freeze_base_net=False, freeze_net=False, gamma=0.1, lr=0.01, mb2_width_mult=1.0, milestones=‘80,100’, momentum=0.9, net=‘mb1-ssd’, num_epochs=1, num_workers=0, pretrained_ssd=‘models/mobilenet-v1-ssd-mp-0_675.pth’, resume=None, scheduler=‘cosine’, t_max=100, use_cuda=True, validation_epochs=1, weight_decay=0.0005)
2022-02-08 11:43:48 - Prepare training datasets.
2022-02-08 11:43:48 - VOC Labels read from file: (‘BACKGROUND’, ‘wood’)
2022-02-08 11:43:48 - Stored labels into file models/wood/labels.txt.
2022-02-08 11:43:48 - Train dataset size: 119
2022-02-08 11:43:48 - Prepare Validation datasets.
2022-02-08 11:43:49 - VOC Labels read from file: (‘BACKGROUND’, ‘wood’)
2022-02-08 11:43:49 - Validation dataset size: 119
2022-02-08 11:43:49 - Build network.
2022-02-08 11:43:49 - Init from pretrained ssd models/mobilenet-v1-ssd-mp-0_675.pth
2022-02-08 11:43:49 - Took 0.54 seconds to load the model.
2022-02-08 11:44:04 - Learning rate: 0.01, Base net learning rate: 0.001, Extra Layers learning rate: 0.01.
2022-02-08 11:44:04 - Uses CosineAnnealingLR scheduler.
2022-02-08 11:44:04 - Start training from epoch 0.

Afterward, nothing happens for several minutes, and I usually get a user warning about lr_scheduler.step() being called before optimizer.step().

Then the program dies with either a memory error, a segmentation fault, or without any specified error. Whatever the case, the result is the same: I don’t get a model.

I tried downloading my data set again in case it was corrupted, but that did not fix it either. Any help into this would be greatly appreciated

Hi @tomas, it seems likely that your Nano is running out of memory. Have you tried mounting swap, disabling ZRAM, and disabling the desktop GUI yet like shown here:

https://github.com/dusty-nv/jetson-inference/blob/master/docs/pytorch-transfer-learning.md#mounting-swap

Also running with --batch-size=1 will decrease the memory usage further.

I tried all of that and I got the following output

2022-02-15 11:54:18 - Using CUDA…
2022-02-15 11:54:18 - Namespace(balance_data=False, base_net=None, base_net_lr=0.001, batch_size=1, checkpoint_folder=‘models/wood’, dataset_type=‘voc’, datasets=[‘data/wood’], debug_steps=10, extra_layers_lr=None, freeze_base_net=False, freeze_net=False, gamma=0.1, lr=0.01, mb2_width_mult=1.0, milestones=‘80,100’, momentum=0.9, net=‘mb1-ssd’, num_epochs=1, num_workers=0, pretrained_ssd=‘models/mobilenet-v1-ssd-mp-0_675.pth’, resume=None, scheduler=‘cosine’, t_max=100, use_cuda=True, validation_epochs=1, weight_decay=0.0005)
2022-02-15 11:54:18 - Prepare training datasets.
2022-02-15 11:54:18 - VOC Labels read from file: (‘BACKGROUND’, ‘wood’)
2022-02-15 11:54:18 - Stored labels into file models/wood/labels.txt.
2022-02-15 11:54:18 - Train dataset size: 119
2022-02-15 11:54:18 - Prepare Validation datasets.
2022-02-15 11:54:19 - VOC Labels read from file: (‘BACKGROUND’, ‘wood’)
2022-02-15 11:54:19 - Validation dataset size: 119
2022-02-15 11:54:19 - Build network.
2022-02-15 11:54:19 - Init from pretrained ssd models/mobilenet-v1-ssd-mp-0_675.pth
2022-02-15 11:54:19 - Took 0.54 seconds to load the model.
2022-02-15 11:54:34 - Learning rate: 0.01, Base net learning rate: 0.001, Extra Layers learning rate: 0.01.
2022-02-15 11:54:34 - Uses CosineAnnealingLR scheduler.
2022-02-15 11:54:34 - Start training from epoch 0.
2022-02-15 12:00:43 - Epoch: 0, Step: 10/119, Avg Loss: 16.7561, Avg Regression Loss 7.7547, Avg Classification Loss: 9.0014

There is still nothing in the models/wood directory except for the labels.txt. Is that all I’m supposed to be getting?

It will save a model to your models/wood directory after each epoch. Since it hasn’t yet completed a training epoch, no models have been saved yet. Sometimes the first epoch takes longer to run since it has to load a bunch of kernels at the beginning. I would let it run for awhile and see what happens.

I’d love to let it run a while, but after getting that output, the program just stops.

I think it just gets stuck there for awhile because memory is low and it’s probably paging out swap? Or does the program crash?

Can you keep an eye on memory/swap usage with sudo tegrastats when running this?

Is it possible to do that while the GUI is disabled? I thought I could only run one process at a time

This time I ran the training program from the nano. While it was running, I ran sudo tegrastats on another computer that was connected to the nano via SSH.
I got the same output as before, and while checking tegrastats it didn’t look like I ran out of SWAP memory. I will attach a text file of the results from tegrastats since it’s over 700 lines long.
tegra.txt (207.7 KB)

OK you got it, that’s what I typically do - although I believe you can press something like Ctrl+Alt+F2 (F3/F4/ect) to switch terminal consoles on the device when the GUI is disabled.

Although you haven’t run out of swap, almost all physical memory is used and more than 2GB of swap are being used:

RAM 3834/3956MB (lfb 2x1MB) SWAP 2575/4096MB

So this could explain the slowness as it swaps out a lot of stuff (which can be rather slow when the swapfile is on SD card). How long have you let it run for?

It may be worth noting, that if you have a Linux PC or server, you can run the pytorch-ssd repo on it and run the training there too. You would need to have PyTorch/ect installed on your PC, and ideally and NVIDIA GPU card in it.

I cloned the pytorch-ssd repo to my Jetson Nano. I also copied the wood data to the data folder in the repo. I ran the same command as before, and nothing happened.

I meant that if you had an x86-based Ubuntu PC with PyTorch installed, you could clone the pytorch-ssd repo to it and run it on there after installing the requirements. If the PC had more compute resources and memory it could go faster.

Although Nano isn’t designed/intended for training DNN’s, for educational/demo purposes Nano should be able to run train_ssd.py though, so not sure why it’s getting hung up on your system. How long have you let it run for before terminating it? Were you able to run it on other datasets like in the Hello AI World tutorial?

Also have you tried using the jetson-inference Docker container, in case the issue is related to something particular to your install?

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.