I am working on a project using Jetson Nano 2GB. I downloaded a subset of images like you showed and started training. After training for 10 hours in epoch 1, it fails. I am stuck. Can you please help? I cannot submit new tickets at github. It fails.
Here is the command I ran to download images, which works:
$ python3 open_images_downloader.py --max-images=5000 --class-names “Plastic bag,Bottle,Tin can,Fish,Drinking straw” --data=data/deepseas
Here is the command I run for the training model and you see the error:
$ python3 train_ssd.py --data=data/deepseas --model-dir=models/deepseas10 --batch-size=4 --epochs=3
2022-01-25 07:54:23 - Using CUDA…
2022-01-25 07:54:23 - Namespace(balance_data=False, base_net=None, base_net_lr
=0.001, batch_size=4, checkpoint_folder=‘models/deepseas10’, dataset_type=‘ope
n_images’, datasets=[‘data/deepseas’], debug_steps=10, extra_layers_lr=None, f
reeze_base_net=False, freeze_net=False, gamma=0.1, lr=0.01, mb2_width_mult=1.0
, milestones=‘80,100’, momentum=0.9, net=‘mb1-ssd’, num_epochs=10, num_workers
=2, pretrained_ssd=‘models/mobilenet-v1-ssd-mp-0_675.pth’, resume=None, schedu
ler=‘cosine’, t_max=100, use_cuda=True, validation_epochs=1, weight_decay=0.00
05)
2022-01-25 07:54:23 - Prepare training datasets.
2022-01-25 07:54:23 - loading annotations from: data/deepseas/sub-train-annota
tions-bbox.csv
2022-01-25 07:54:23 - annotations loaded from: data/deepseas/sub-train-annota
tions-bbox.csv
num images: 4461
2022-01-25 07:54:41 - Dataset Summary:Number of Images: 4461
Minimum Number of Images for a Class: -1
Label Distribution:
Bottle: 9975
Drinking straw: 81
Fish: 5647
Plastic bag: 267
Tin can: 756
2022-01-25 07:54:41 - Stored labels into file models/deepseas10/labels.txt.
2022-01-25 07:54:41 - Train dataset size: 4461
2022-01-25 07:54:41 - Prepare Validation datasets.
2022-01-25 07:54:41 - loading annotations from: data/deepseas/sub-test-annotat
ions-bbox.csv
2022-01-25 07:54:41 - annotations loaded from: data/deepseas/sub-test-annotat
ions-bbox.csv
num images: 398
2022-01-25 07:54:43 - Dataset Summary:Number of Images: 398
Minimum Number of Images for a Class: -1
Label Distribution:
Bottle: 305
Drinking straw: 3
Fish: 343
Plastic bag: 4
Tin can: 41
2022-01-25 07:54:43 - Validation dataset size: 398
2022-01-25 07:54:43 - Build network.
2022-01-25 07:54:43 - Init from pretrained ssd models/mobilenet-v1-ssd-mp-0_67
5.pth
2022-01-25 07:54:44 - Took 0.54 seconds to load the model.
2022-01-25 07:55:01 - Learning rate: 0.01, Base net learning rate: 0.001, Extr
a Layers learning rate: 0.01.
2022-01-25 07:55:01 - Uses CosineAnnealingLR scheduler.
2022-01-25 07:55:01 - Start training from epoch 0.
/home/nikil/.local/lib/python3.6/site-packages/torch/optim/lr_scheduler.py:123
: UserWarning: Detected call of lr_scheduler.step()
before optimizer.step()
. In PyTorch 1.1.0 and later, you should call them in the opposite order: op timizer.step()
before lr_scheduler.step()
. Failure to do this will result
in PyTorch skipping the first value of the learning rate schedule. See more de
tails at torch.optim — PyTorch 1.12 documentation
e
“torch.optim — PyTorch 1.12 documentation”, Us
erWarning)
2022-01-25 07:57:34 - Epoch: 0, Step: 10/1116, Avg Loss: 12.0591, Avg Regressi
on Loss 4.3237, Avg Classification Loss: 7.7354
2022-01-25 07:58:59 - Epoch: 0, Step: 20/1116, Avg Loss: 7.4997, Avg Regressio
n Loss 3.2166, Avg Classification Loss: 4.2832
2022-01-25 08:02:37 - Epoch: 0, Step: 30/1116, Avg Loss: 8.4262, Avg Regressio
n Loss 3.9077, Avg Classification Loss: 4.5185
.
.
.
.
.
.
2022-01-26 06:31:04 - Epoch: 1, Step: 30/1116, Avg Loss: 5.3783, Avg Regression Loss 2.1821, Avg Classification Loss: 3.1962
Traceback (most recent call last):
** File “train_ssd.py”, line 343, in **
** device=DEVICE, debug_steps=args.debug_steps, epoch=epoch)**
** File “train_ssd.py”, line 113, in train**
** for i, data in enumerate(loader):**
** File “/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py”, line 521, in next**
** data = self._next_data()**
** File “/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py”, line 1203, in _next_data**
** return self._process_data(data)**
** File “/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py”, line 1229, in _process_data**
** data.reraise()**
** File “/usr/local/lib/python3.6/dist-packages/torch/_utils.py”, line 425, in reraise**
** raise self.exc_type(msg)**
TypeError: init() missing 1 required positional argument: ‘dtype’
It fails in exactly the same place. Please help.