PLEASE HELP: nvidia Jetson 2GB training fails - TypeError: init() missing 1 required positional argument: 'dtype'

user152363 · January 26, 2022, 4:51pm

I am working on a project using Jetson Nano 2GB. I downloaded a subset of images like you showed and started training. After training for 10 hours in epoch 1, it fails. I am stuck. Can you please help? I cannot submit new tickets at github. It fails.

Here is the command I ran to download images, which works:

$ python3 open_images_downloader.py --max-images=5000 --class-names “Plastic bag,Bottle,Tin can,Fish,Drinking straw” --data=data/deepseas

Here is the command I run for the training model and you see the error:

$ python3 train_ssd.py --data=data/deepseas --model-dir=models/deepseas10 --batch-size=4 --epochs=3

2022-01-25 07:54:23 - Using CUDA…
2022-01-25 07:54:23 - Namespace(balance_data=False, base_net=None, base_net_lr
=0.001, batch_size=4, checkpoint_folder=‘models/deepseas10’, dataset_type=‘ope
n_images’, datasets=[‘data/deepseas’], debug_steps=10, extra_layers_lr=None, f
reeze_base_net=False, freeze_net=False, gamma=0.1, lr=0.01, mb2_width_mult=1.0
, milestones=‘80,100’, momentum=0.9, net=‘mb1-ssd’, num_epochs=10, num_workers
=2, pretrained_ssd=‘models/mobilenet-v1-ssd-mp-0_675.pth’, resume=None, schedu
ler=‘cosine’, t_max=100, use_cuda=True, validation_epochs=1, weight_decay=0.00
05)
2022-01-25 07:54:23 - Prepare training datasets.
2022-01-25 07:54:23 - loading annotations from: data/deepseas/sub-train-annota
tions-bbox.csv
2022-01-25 07:54:23 - annotations loaded from: data/deepseas/sub-train-annota
tions-bbox.csv
num images: 4461
2022-01-25 07:54:41 - Dataset Summary:Number of Images: 4461
Minimum Number of Images for a Class: -1
Label Distribution:
Bottle: 9975
Drinking straw: 81
Fish: 5647
Plastic bag: 267
Tin can: 756
2022-01-25 07:54:41 - Stored labels into file models/deepseas10/labels.txt.
2022-01-25 07:54:41 - Train dataset size: 4461
2022-01-25 07:54:41 - Prepare Validation datasets.
2022-01-25 07:54:41 - loading annotations from: data/deepseas/sub-test-annotat
ions-bbox.csv
2022-01-25 07:54:41 - annotations loaded from: data/deepseas/sub-test-annotat
ions-bbox.csv
num images: 398
2022-01-25 07:54:43 - Dataset Summary:Number of Images: 398
Minimum Number of Images for a Class: -1
Label Distribution:
Bottle: 305
Drinking straw: 3
Fish: 343
Plastic bag: 4
Tin can: 41
2022-01-25 07:54:43 - Validation dataset size: 398
2022-01-25 07:54:43 - Build network.
2022-01-25 07:54:43 - Init from pretrained ssd models/mobilenet-v1-ssd-mp-0_67
5.pth
2022-01-25 07:54:44 - Took 0.54 seconds to load the model.
2022-01-25 07:55:01 - Learning rate: 0.01, Base net learning rate: 0.001, Extr
a Layers learning rate: 0.01.
2022-01-25 07:55:01 - Uses CosineAnnealingLR scheduler.
2022-01-25 07:55:01 - Start training from epoch 0.
/home/nikil/.local/lib/python3.6/site-packages/torch/optim/lr_scheduler.py:123
: UserWarning: Detected call of lr_scheduler.step() before optimizer.step() . In PyTorch 1.1.0 and later, you should call them in the opposite order: op timizer.step() before lr_scheduler.step(). Failure to do this will result
in PyTorch skipping the first value of the learning rate schedule. See more de
tails at torch.optim — PyTorch 1.12 documentation
e
“torch.optim — PyTorch 1.12 documentation”, Us
erWarning)
2022-01-25 07:57:34 - Epoch: 0, Step: 10/1116, Avg Loss: 12.0591, Avg Regressi
on Loss 4.3237, Avg Classification Loss: 7.7354
2022-01-25 07:58:59 - Epoch: 0, Step: 20/1116, Avg Loss: 7.4997, Avg Regressio
n Loss 3.2166, Avg Classification Loss: 4.2832
2022-01-25 08:02:37 - Epoch: 0, Step: 30/1116, Avg Loss: 8.4262, Avg Regressio
n Loss 3.9077, Avg Classification Loss: 4.5185
.
.
.
.
.
.

2022-01-26 06:31:04 - Epoch: 1, Step: 30/1116, Avg Loss: 5.3783, Avg Regression Loss 2.1821, Avg Classification Loss: 3.1962
Traceback (most recent call last):
** File “train_ssd.py”, line 343, in **
** device=DEVICE, debug_steps=args.debug_steps, epoch=epoch)**
** File “train_ssd.py”, line 113, in train**
** for i, data in enumerate(loader):**
** File “/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py”, line 521, in next**
** data = self._next_data()**
** File “/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py”, line 1203, in _next_data**
** return self._process_data(data)**
** File “/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py”, line 1229, in _process_data**
** data.reraise()**
** File “/usr/local/lib/python3.6/dist-packages/torch/_utils.py”, line 425, in reraise**
** raise self.exc_type(msg)**
TypeError: init() missing 1 required positional argument: ‘dtype’

It fails in exactly the same place. Please help.

dusty_nv · January 26, 2022, 6:16pm

Hi @user152363, there are two ideas that come to mind:

It’s possible there is some data corruption in the dataset, although I think that would make a different error in the dataloader
Since you are on Nano 2GB, the board is low on memory

My guess is that it’s the later issue. Can you try running it with --batch-size=1 --workers=0 ?

Also, have you tried mounting additional swap, disabling ZRAM, and disabling the desktop UI to save memory?
https://github.com/dusty-nv/jetson-inference/blob/master/docs/pytorch-transfer-learning.md#mounting-swap

user152363 · January 26, 2022, 7:25pm

1. It’s possible there is some data corruption in the dataset, although I think that would make a different error in the dataloader
This is a possibility because two times it failed in exactly the same location.

2. Since you are on Nano 2GB, the board is low on memory
Yeah. This could be the issue. I just ran for a small data set for 4 iterations and it ran without any issues. And, I also did noticed that when I was running, the Jetson would pop up messages saying low memory. Let me change the swap and run it again. If it still fails, then I will bring the number of threads down. By the way, I tried once by setting --workers=0, it still failed.

I will keep you updated.

dusty_nv · January 26, 2022, 8:35pm

OK gotcha - note that reducing the batch size will reduce the amount of memory used, because the model is smaller. Reducing the number of workers reduces the number of threads running.

user152363 · January 27, 2022, 3:10pm

Dusty,

SUCCESS! I simply set the swap the way you suggested in https://github.com/dusty-nv/jetson-inference/blob/master/docs/pytorch-transfer-learning.md#mounting-swa and IT WORKED. Problem resolved. NO MORE ISSUES

Having said that is there any way to train the model in faster computers? I am going to pose a question related to this in a new thread.

Let’s close this thread.
Thank you Dustin!

dusty_nv · January 27, 2022, 4:41pm

Hi @user152363, glad to hear that you got it working! Yes, you can run the pytorch-ssd repo on a GPU-enabled Linux PC or server that has PyTorch/CUDA/cuDNN installed on it. I run it on my Ubuntu laptop that has a GeForce card in it using the NGC PyTorch container (using this container all you need to have installed in Ubuntu is the NVIDIA driver and docker).

system · March 2, 2022, 2:40am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Jetson-inference: cannot train model with custom data set Jetson Nano jetson-inference	11	1977	March 9, 2022
Error during re-training SSD-Mobilenet using Jetson Nano 2GB Jetson Nano jetson-inference	4	27	April 16, 2025
Training custom model on Jetson Nano doesnt work Jetson Nano jetson-inference , ai-training	5	491	January 22, 2024
Jetson nano start the Docker an error occurred while training your detection model ：Segmentation fault (core dumped) Jetson Nano jetson-inference	7	1247	April 21, 2022
Train_ssd.py error - Training Object Detection Models Jetson Nano ai-training	10	1430	October 6, 2022
Cuda runtime error while re-training SSD Jetson Nano ai-training	6	1748	October 15, 2021
Train_ssd.py dosen't work with pascal voc dataset Jetson Nano ai-training	5	1135	February 9, 2022
Jetson Nano 2GB Killed (Out Of Memory) During Re-Training Jetson Nano ai-training	20	3215	November 22, 2021
Successful training with "train_ssd.py" using small custom data set, but error on full data set Jetson Nano ai-training	6	1828	October 18, 2021
Jetson-inference: Retraining cat_dog using train.py is not running Jetson Nano	8	941	October 14, 2021

PLEASE HELP: nvidia Jetson 2GB training fails - TypeError: __init__() missing 1 required positional argument: 'dtype'

$ python3 open_images_downloader.py --max-images=5000 --class-names “Plastic bag,Bottle,Tin can,Fish,Drinking straw” --data=data/deepseas

$ python3 train_ssd.py --data=data/deepseas --model-dir=models/deepseas10 --batch-size=4 --epochs=3

Related topics

PLEASE HELP: nvidia Jetson 2GB training fails - TypeError: init() missing 1 required positional argument: 'dtype'