Error during re-training SSD-Mobilenet using Jetson Nano 2GB

guruh2000 · April 12, 2025, 8:00am

Hi. I am following this documentation ( jetson-inference/docs/pytorch-ssd.md at master · dusty-nv/jetson-inference · GitHub ) on re-training SSD model. When i execute the code below…

python3 train_ssd.py --data=data/fruit --model-dir=models/fruit --batch-size=4 --epochs=30

…the output is…

2025-04-12 06:00:02 - Using CUDA... 2025-04-12 06:00:02 - Namespace(balance_data=False, base_net=None, base_net_lr=0.001, batch_size=4, checkpoint_folder='mode ls/fruit', dataset_type='open_images', datasets=['data/fruit'], debug_steps=10, extra_layers_lr=None, freeze_base_net=False , freeze_net=False, gamma=0.1, log_level='info', lr=0.01, mb2_width_mult=1.0, milestones='80,100', momentum=0.9, net='mb1-s sd', num_epochs=30, num_workers=2, pretrained_ssd='models/mobilenet-v1-ssd-mp-0_675.pth', resolution=300, resume=None, sche duler='cosine', t_max=100, use_cuda=True, validation_epochs=1, validation_mean_ap=False, weight_decay=0.0005) 2025-04-12 06:01:06 - model resolution 300x300 2025-04-12 06:01:07 - SSDSpec(feature_map_size=19, shrinkage=16, box_sizes=SSDBoxSizes(min=60, max=105), aspect_ratios=[2, 3]) 2025-04-12 06:01:07 - SSDSpec(feature_map_size=10, shrinkage=32, box_sizes=SSDBoxSizes(min=105, max=150), aspect_ratios=[2, 3]) 2025-04-12 06:01:07 - SSDSpec(feature_map_size=5, shrinkage=64, box_sizes=SSDBoxSizes(min=150, max=195), aspect_ratios=[2, 3]) 2025-04-12 06:01:07 - SSDSpec(feature_map_size=3, shrinkage=100, box_sizes=SSDBoxSizes(min=195, max=240), aspect_ratios=[2, 3]) 2025-04-12 06:01:07 - SSDSpec(feature_map_size=2, shrinkage=150, box_sizes=SSDBoxSizes(min=240, max=285), aspect_ratios=[2, 3]) 2025-04-12 06:01:07 - SSDSpec(feature_map_size=1, shrinkage=300, box_sizes=SSDBoxSizes(min=285, max=330), aspect_ratios=[2, 3]) 2025-04-12 06:01:09 - Prepare training datasets. 2025-04-12 06:01:09 - loading annotations from: data/fruit/sub-train-annotations-bbox.csv 2025-04-12 06:01:11 - annotations loaded from: data/fruit/sub-train-annotations-bbox.csv num images: 10 2025-04-12 06:01:11 - Dataset Summary:Number of Images: 10 Minimum Number of Images for a Class: -1 Label Distribution: Apple: 3 Grape: 1 Orange: 86 Strawberry: 3 Watermelon: 8 2025-04-12 06:01:11 - Stored labels into file models/fruit/labels.txt. 2025-04-12 06:01:11 - Train dataset size: 10 2025-04-12 06:01:11 - Prepare Validation datasets. 2025-04-12 06:01:11 - loading annotations from: data/fruit/sub-test-annotations-bbox.csv 2025-04-12 06:01:11 - annotations loaded from: data/fruit/sub-test-annotations-bbox.csv num images: 930 2025-04-12 06:01:14 - Dataset Summary:Number of Images: 930 Minimum Number of Images for a Class: -1 Label Distribution: Apple: 329 Banana: 132 Grape: 446 Orange: 826 Pear: 107 Pineapple: 105 Strawberry: 754 Watermelon: 125 2025-04-12 06:01:14 - Validation dataset size: 930 2025-04-12 06:01:14 - Build network. 2025-04-12 06:01:16 - Init from pretrained SSD models/mobilenet-v1-ssd-mp-0_675.pth models/mobilenet-v1-ssd-mp-0_6 100%[===================================================>] 36.23M 3.46MB/s in 13s 2025-04-12 06:01:34 - Took 17.71 seconds to load the model. 2025-04-12 06:01:34 - Learning rate: 0.01, Base net learning rate: 0.001, Extra Layers learning rate: 0.01. 2025-04-12 06:01:34 - Uses CosineAnnealingLR scheduler. 2025-04-12 06:01:34 - Start training from epoch 0. /usr/local/lib/python3.6/dist-packages/torch/nn/_reduction.py:42: UserWarning: size_average and reduce args will be depreca ted, please use reduction='sum' instead. warnings.warn(warning.format(ret)) 2025-04-12 06:04:32 - Epoch: 0, Training Loss: 15.0934, Training Regression Loss 5.5559, Training Classification Loss: 9.53 75 /media/nvidia/NVME/pytorch/pytorch-v1.10.0/aten/src/ATen/native/cuda/Loss.cu:247: nll_loss_forward_reduce_cuda_kernel_2d: b lock: [0,0,0], thread: [13,0,0] Assertion `t >= 0 && t < n_classes` failed. /media/nvidia/NVME/pytorch/pytorch-v1.10.0/aten/src/ATen/native/cuda/Loss.cu:247: nll_loss_forward_reduce_cuda_kernel_2d: b lock: [0,0,0], thread: [14,0,0] Assertion `t >= 0 && t < n_classes` failed. /media/nvidia/NVME/pytorch/pytorch-v1.10.0/aten/src/ATen/native/cuda/Loss.cu:247: nll_loss_forward_reduce_cuda_kernel_2d: b lock: [0,0,0], thread: [15,0,0] Assertion `t >= 0 && t < n_classes` failed. /media/nvidia/NVME/pytorch/pytorch-v1.10.0/aten/src/ATen/native/cuda/Loss.cu:247: nll_loss_forward_reduce_cuda_kernel_2d: b lock: [0,0,0], thread: [18,0,0] Assertion `t >= 0 && t < n_classes` failed. /media/nvidia/NVME/pytorch/pytorch-v1.10.0/aten/src/ATen/native/cuda/Loss.cu:247: nll_loss_forward_reduce_cuda_kernel_2d: b lock: [0,0,0], thread: [19,0,0] Assertion `t >= 0 && t < n_classes` failed. /media/nvidia/NVME/pytorch/pytorch-v1.10.0/aten/src/ATen/native/cuda/Loss.cu:247: nll_loss_forward_reduce_cuda_kernel_2d: b lock: [0,0,0], thread: [20,0,0] Assertion `t >= 0 && t < n_classes` failed. /media/nvidia/NVME/pytorch/pytorch-v1.10.0/aten/src/ATen/native/cuda/Loss.cu:247: nll_loss_forward_reduce_cuda_kernel_2d: b lock: [0,0,0], thread: [21,0,0] Assertion `t >= 0 && t < n_classes` failed. /media/nvidia/NVME/pytorch/pytorch-v1.10.0/aten/src/ATen/native/cuda/Loss.cu:247: nll_loss_forward_reduce_cuda_kernel_2d: b lock: [0,0,0], thread: [27,0,0] Assertion `t >= 0 && t < n_classes` failed. /media/nvidia/NVME/pytorch/pytorch-v1.10.0/aten/src/ATen/native/cuda/Loss.cu:247: nll_loss_forward_reduce_cuda_kernel_2d: b lock: [0,0,0], thread: [28,0,0] Assertion `t >= 0 && t < n_classes` failed. /media/nvidia/NVME/pytorch/pytorch-v1.10.0/aten/src/ATen/native/cuda/Loss.cu:247: nll_loss_forward_reduce_cuda_kernel_2d: b lock: [0,0,0], thread: [29,0,0] Assertion `t >= 0 && t < n_classes` failed. /media/nvidia/NVME/pytorch/pytorch-v1.10.0/aten/src/ATen/native/cuda/Loss.cu:247: nll_loss_forward_reduce_cuda_kernel_2d: b lock: [0,0,0], thread: [31,0,0] Assertion `t >= 0 && t < n_classes` failed. Traceback (most recent call last): File "train_ssd.py", line 410, in <module> val_loss, val_regression_loss, val_classification_loss = test(val_loader, net, criterion, DEVICE) File "train_ssd.py", line 206, in test regression_loss, classification_loss = criterion(confidence, locations, labels, boxes) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/jetson-inference/python/training/detection/ssd/vision/nn/multibox_loss.py", line 43, in forward predicted_locations = predicted_locations[pos_mask, :].reshape(-1, 4) RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

…i got the error “CUDA error: device-side assert triggered”. Did i miss something?

Jetpack Version: 4.6-b199
Python version: Python 3.6.9

Thank you

AastaLLL · April 14, 2025, 2:40am

Hi,

Please make sure you have checked out the corresponding branch.
For example, using L4T-R32.7.1 for the JetPack 4.6.1.

Thanks

guruh2000 · April 15, 2025, 7:48am

Hi, AastaLLL

I followed your link, and i think the content is the same as my original link.

Then, i just delete my dataset (and folder) in “/jetson-inference/python/training/detection/ssd/data”, reboot the Jetson Nano, download dataset again from Open Images (this time 3 times more pictures), and run the “train_ssd.py” command again. The training running flawlessy for 2 hours and 30 minutes before i kill the script with ctrl-c.

Thank you AastaLLL for the idea.

AastaLLL · April 16, 2025, 7:31am

There is no update from you for a period, assuming this is not an issue anymore.
Hence, we are closing this topic. If need further support, please open a new one.
Thanks ~0507

Hi,

Just want to double-check.
Are you able to run the transfer learning on the Nano 2GB?

Thanks.

system · May 21, 2025, 1:03am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Jetson nano start the Docker an error occurred while training your detection model ：Segmentation fault (core dumped) Jetson Nano jetson-inference	7	1234	April 21, 2022
Problem with Re-training SSD-Mobilenet Jetson Nano cuda , tensorflow , jetson-inference , python	2	810	March 3, 2022
Problem with mobilenet-v1-ssd-mp-0_675.pth when re-training SSD-MOBILENET Jetson Nano tensorrt , cuda , jetson-inference , python	2	1473	March 3, 2022
Re-training SSD-Mobilenet: gt_locations consist of nan values which causing Regression Loss to NaN Jetson Nano ai-training	2	922	September 13, 2022
Problems with train_ssd.py Jetson Nano	2	1018	October 14, 2021
Using jetson nano i conducted a training of my own model for object detection with the help of trained model but it shows an error and it is below Jetson Nano jetson-inference	18	908	July 15, 2022
PLEASE HELP: nvidia Jetson 2GB training fails - TypeError: __init__() missing 1 required positional argument: 'dtype' Jetson Nano ai-training	6	2383	March 2, 2022
Pickle error when training SSD MobileNet Jetson Nano jetson-inference	4	781	August 2, 2023
I cannot train a detection model. I get the error: RuntimeError: Error in loading state_dict for SSD: Unexpected key(s) in state_dict: Jetson Nano jetson-inference	8	2887	October 15, 2021
Deep Learning Inference Benchmarking Instructions Jetson Nano	134	47567	May 30, 2023

Error during re-training SSD-Mobilenet using Jetson Nano 2GB

Related topics