PLEASE HELP: nvidia Jetson 2GB training fails - TypeError: __init__() missing 1 required positional argument: 'dtype'

I am working on a project using Jetson Nano 2GB. I downloaded a subset of images like you showed and started training. After training for 10 hours in epoch 1, it fails. I am stuck. Can you please help? I cannot submit new tickets at github. It fails.

Here is the command I ran to download images, which works:

$ python3 open_images_downloader.py --max-images=5000 --class-names “Plastic bag,Bottle,Tin can,Fish,Drinking straw” --data=data/deepseas

Here is the command I run for the training model and you see the error:

$ python3 train_ssd.py --data=data/deepseas --model-dir=models/deepseas10 --batch-size=4 --epochs=3

2022-01-25 07:54:23 - Using CUDA…
2022-01-25 07:54:23 - Namespace(balance_data=False, base_net=None, base_net_lr
=0.001, batch_size=4, checkpoint_folder=‘models/deepseas10’, dataset_type=‘ope
n_images’, datasets=[‘data/deepseas’], debug_steps=10, extra_layers_lr=None, f
reeze_base_net=False, freeze_net=False, gamma=0.1, lr=0.01, mb2_width_mult=1.0
, milestones=‘80,100’, momentum=0.9, net=‘mb1-ssd’, num_epochs=10, num_workers
=2, pretrained_ssd=‘models/mobilenet-v1-ssd-mp-0_675.pth’, resume=None, schedu
ler=‘cosine’, t_max=100, use_cuda=True, validation_epochs=1, weight_decay=0.00
05)
2022-01-25 07:54:23 - Prepare training datasets.
2022-01-25 07:54:23 - loading annotations from: data/deepseas/sub-train-annota
tions-bbox.csv
2022-01-25 07:54:23 - annotations loaded from: data/deepseas/sub-train-annota
tions-bbox.csv
num images: 4461
2022-01-25 07:54:41 - Dataset Summary:Number of Images: 4461
Minimum Number of Images for a Class: -1
Label Distribution:
Bottle: 9975
Drinking straw: 81
Fish: 5647
Plastic bag: 267
Tin can: 756
2022-01-25 07:54:41 - Stored labels into file models/deepseas10/labels.txt.
2022-01-25 07:54:41 - Train dataset size: 4461
2022-01-25 07:54:41 - Prepare Validation datasets.
2022-01-25 07:54:41 - loading annotations from: data/deepseas/sub-test-annotat
ions-bbox.csv
2022-01-25 07:54:41 - annotations loaded from: data/deepseas/sub-test-annotat
ions-bbox.csv
num images: 398
2022-01-25 07:54:43 - Dataset Summary:Number of Images: 398
Minimum Number of Images for a Class: -1
Label Distribution:
Bottle: 305
Drinking straw: 3
Fish: 343
Plastic bag: 4
Tin can: 41
2022-01-25 07:54:43 - Validation dataset size: 398
2022-01-25 07:54:43 - Build network.
2022-01-25 07:54:43 - Init from pretrained ssd models/mobilenet-v1-ssd-mp-0_67
5.pth
2022-01-25 07:54:44 - Took 0.54 seconds to load the model.
2022-01-25 07:55:01 - Learning rate: 0.01, Base net learning rate: 0.001, Extr
a Layers learning rate: 0.01.
2022-01-25 07:55:01 - Uses CosineAnnealingLR scheduler.
2022-01-25 07:55:01 - Start training from epoch 0.
/home/nikil/.local/lib/python3.6/site-packages/torch/optim/lr_scheduler.py:123
: UserWarning: Detected call of lr_scheduler.step() before optimizer.step() . In PyTorch 1.1.0 and later, you should call them in the opposite order: op timizer.step() before lr_scheduler.step(). Failure to do this will result
in PyTorch skipping the first value of the learning rate schedule. See more de
tails at torch.optim — PyTorch 1.10.1 documentation
e
torch.optim — PyTorch 1.10.1 documentation”, Us
erWarning)
2022-01-25 07:57:34 - Epoch: 0, Step: 10/1116, Avg Loss: 12.0591, Avg Regressi
on Loss 4.3237, Avg Classification Loss: 7.7354
2022-01-25 07:58:59 - Epoch: 0, Step: 20/1116, Avg Loss: 7.4997, Avg Regressio
n Loss 3.2166, Avg Classification Loss: 4.2832
2022-01-25 08:02:37 - Epoch: 0, Step: 30/1116, Avg Loss: 8.4262, Avg Regressio
n Loss 3.9077, Avg Classification Loss: 4.5185
.
.
.
.
.
.

2022-01-26 06:31:04 - Epoch: 1, Step: 30/1116, Avg Loss: 5.3783, Avg Regression Loss 2.1821, Avg Classification Loss: 3.1962
Traceback (most recent call last):
** File “train_ssd.py”, line 343, in **
** device=DEVICE, debug_steps=args.debug_steps, epoch=epoch)**
** File “train_ssd.py”, line 113, in train**
** for i, data in enumerate(loader):**
** File “/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py”, line 521, in next**
** data = self._next_data()**
** File “/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py”, line 1203, in _next_data**
** return self._process_data(data)**
** File “/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py”, line 1229, in _process_data**
** data.reraise()**
** File “/usr/local/lib/python3.6/dist-packages/torch/_utils.py”, line 425, in reraise**
** raise self.exc_type(msg)**
TypeError: init() missing 1 required positional argument: 'dtype’

It fails in exactly the same place. Please help.

Hi @user152363, there are two ideas that come to mind:

  1. It’s possible there is some data corruption in the dataset, although I think that would make a different error in the dataloader

  2. Since you are on Nano 2GB, the board is low on memory

My guess is that it’s the later issue. Can you try running it with --batch-size=1 --workers=0 ?

Also, have you tried mounting additional swap, disabling ZRAM, and disabling the desktop UI to save memory?
https://github.com/dusty-nv/jetson-inference/blob/master/docs/pytorch-transfer-learning.md#mounting-swap

1. It’s possible there is some data corruption in the dataset, although I think that would make a different error in the dataloader
This is a possibility because two times it failed in exactly the same location.

2. Since you are on Nano 2GB, the board is low on memory
Yeah. This could be the issue. I just ran for a small data set for 4 iterations and it ran without any issues. And, I also did noticed that when I was running, the Jetson would pop up messages saying low memory. Let me change the swap and run it again. If it still fails, then I will bring the number of threads down. By the way, I tried once by setting --workers=0, it still failed.

I will keep you updated.

OK gotcha - note that reducing the batch size will reduce the amount of memory used, because the model is smaller. Reducing the number of workers reduces the number of threads running.

Dusty,

SUCCESS! I simply set the swap the way you suggested in https://github.com/dusty-nv/jetson-inference/blob/master/docs/pytorch-transfer-learning.md#mounting-swa and IT WORKED. Problem resolved. NO MORE ISSUES

Having said that is there any way to train the model in faster computers? I am going to pose a question related to this in a new thread.

Let’s close this thread.
Thank you Dustin!

Hi @user152363, glad to hear that you got it working! Yes, you can run the pytorch-ssd repo on a GPU-enabled Linux PC or server that has PyTorch/CUDA/cuDNN installed on it. I run it on my Ubuntu laptop that has a GeForce card in it using the NGC PyTorch container (using this container all you need to have installed in Ubuntu is the NVIDIA driver and docker).

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.