Error during re-training SSD-Mobilenet using Jetson Nano 2GB

Hi. I am following this documentation ( jetson-inference/docs/pytorch-ssd.md at master · dusty-nv/jetson-inference · GitHub ) on re-training SSD model. When i execute the code below…

python3 train_ssd.py --data=data/fruit --model-dir=models/fruit --batch-size=4 --epochs=30

…the output is…

2025-04-12 06:00:02 - Using CUDA... 2025-04-12 06:00:02 - Namespace(balance_data=False, base_net=None, base_net_lr=0.001, batch_size=4, checkpoint_folder='mode ls/fruit', dataset_type='open_images', datasets=['data/fruit'], debug_steps=10, extra_layers_lr=None, freeze_base_net=False , freeze_net=False, gamma=0.1, log_level='info', lr=0.01, mb2_width_mult=1.0, milestones='80,100', momentum=0.9, net='mb1-s sd', num_epochs=30, num_workers=2, pretrained_ssd='models/mobilenet-v1-ssd-mp-0_675.pth', resolution=300, resume=None, sche duler='cosine', t_max=100, use_cuda=True, validation_epochs=1, validation_mean_ap=False, weight_decay=0.0005) 2025-04-12 06:01:06 - model resolution 300x300 2025-04-12 06:01:07 - SSDSpec(feature_map_size=19, shrinkage=16, box_sizes=SSDBoxSizes(min=60, max=105), aspect_ratios=[2, 3]) 2025-04-12 06:01:07 - SSDSpec(feature_map_size=10, shrinkage=32, box_sizes=SSDBoxSizes(min=105, max=150), aspect_ratios=[2, 3]) 2025-04-12 06:01:07 - SSDSpec(feature_map_size=5, shrinkage=64, box_sizes=SSDBoxSizes(min=150, max=195), aspect_ratios=[2, 3]) 2025-04-12 06:01:07 - SSDSpec(feature_map_size=3, shrinkage=100, box_sizes=SSDBoxSizes(min=195, max=240), aspect_ratios=[2, 3]) 2025-04-12 06:01:07 - SSDSpec(feature_map_size=2, shrinkage=150, box_sizes=SSDBoxSizes(min=240, max=285), aspect_ratios=[2, 3]) 2025-04-12 06:01:07 - SSDSpec(feature_map_size=1, shrinkage=300, box_sizes=SSDBoxSizes(min=285, max=330), aspect_ratios=[2, 3]) 2025-04-12 06:01:09 - Prepare training datasets. 2025-04-12 06:01:09 - loading annotations from: data/fruit/sub-train-annotations-bbox.csv 2025-04-12 06:01:11 - annotations loaded from: data/fruit/sub-train-annotations-bbox.csv num images: 10 2025-04-12 06:01:11 - Dataset Summary:Number of Images: 10 Minimum Number of Images for a Class: -1 Label Distribution: Apple: 3 Grape: 1 Orange: 86 Strawberry: 3 Watermelon: 8 2025-04-12 06:01:11 - Stored labels into file models/fruit/labels.txt. 2025-04-12 06:01:11 - Train dataset size: 10 2025-04-12 06:01:11 - Prepare Validation datasets. 2025-04-12 06:01:11 - loading annotations from: data/fruit/sub-test-annotations-bbox.csv 2025-04-12 06:01:11 - annotations loaded from: data/fruit/sub-test-annotations-bbox.csv num images: 930 2025-04-12 06:01:14 - Dataset Summary:Number of Images: 930 Minimum Number of Images for a Class: -1 Label Distribution: Apple: 329 Banana: 132 Grape: 446 Orange: 826 Pear: 107 Pineapple: 105 Strawberry: 754 Watermelon: 125 2025-04-12 06:01:14 - Validation dataset size: 930 2025-04-12 06:01:14 - Build network. 2025-04-12 06:01:16 - Init from pretrained SSD models/mobilenet-v1-ssd-mp-0_675.pth models/mobilenet-v1-ssd-mp-0_6 100%[===================================================>] 36.23M 3.46MB/s in 13s 2025-04-12 06:01:34 - Took 17.71 seconds to load the model. 2025-04-12 06:01:34 - Learning rate: 0.01, Base net learning rate: 0.001, Extra Layers learning rate: 0.01. 2025-04-12 06:01:34 - Uses CosineAnnealingLR scheduler. 2025-04-12 06:01:34 - Start training from epoch 0. /usr/local/lib/python3.6/dist-packages/torch/nn/_reduction.py:42: UserWarning: size_average and reduce args will be depreca ted, please use reduction='sum' instead. warnings.warn(warning.format(ret)) 2025-04-12 06:04:32 - Epoch: 0, Training Loss: 15.0934, Training Regression Loss 5.5559, Training Classification Loss: 9.53 75 /media/nvidia/NVME/pytorch/pytorch-v1.10.0/aten/src/ATen/native/cuda/Loss.cu:247: nll_loss_forward_reduce_cuda_kernel_2d: b lock: [0,0,0], thread: [13,0,0] Assertion `t >= 0 && t < n_classes` failed. /media/nvidia/NVME/pytorch/pytorch-v1.10.0/aten/src/ATen/native/cuda/Loss.cu:247: nll_loss_forward_reduce_cuda_kernel_2d: b lock: [0,0,0], thread: [14,0,0] Assertion `t >= 0 && t < n_classes` failed. /media/nvidia/NVME/pytorch/pytorch-v1.10.0/aten/src/ATen/native/cuda/Loss.cu:247: nll_loss_forward_reduce_cuda_kernel_2d: b lock: [0,0,0], thread: [15,0,0] Assertion `t >= 0 && t < n_classes` failed. /media/nvidia/NVME/pytorch/pytorch-v1.10.0/aten/src/ATen/native/cuda/Loss.cu:247: nll_loss_forward_reduce_cuda_kernel_2d: b lock: [0,0,0], thread: [18,0,0] Assertion `t >= 0 && t < n_classes` failed. /media/nvidia/NVME/pytorch/pytorch-v1.10.0/aten/src/ATen/native/cuda/Loss.cu:247: nll_loss_forward_reduce_cuda_kernel_2d: b lock: [0,0,0], thread: [19,0,0] Assertion `t >= 0 && t < n_classes` failed. /media/nvidia/NVME/pytorch/pytorch-v1.10.0/aten/src/ATen/native/cuda/Loss.cu:247: nll_loss_forward_reduce_cuda_kernel_2d: b lock: [0,0,0], thread: [20,0,0] Assertion `t >= 0 && t < n_classes` failed. /media/nvidia/NVME/pytorch/pytorch-v1.10.0/aten/src/ATen/native/cuda/Loss.cu:247: nll_loss_forward_reduce_cuda_kernel_2d: b lock: [0,0,0], thread: [21,0,0] Assertion `t >= 0 && t < n_classes` failed. /media/nvidia/NVME/pytorch/pytorch-v1.10.0/aten/src/ATen/native/cuda/Loss.cu:247: nll_loss_forward_reduce_cuda_kernel_2d: b lock: [0,0,0], thread: [27,0,0] Assertion `t >= 0 && t < n_classes` failed. /media/nvidia/NVME/pytorch/pytorch-v1.10.0/aten/src/ATen/native/cuda/Loss.cu:247: nll_loss_forward_reduce_cuda_kernel_2d: b lock: [0,0,0], thread: [28,0,0] Assertion `t >= 0 && t < n_classes` failed. /media/nvidia/NVME/pytorch/pytorch-v1.10.0/aten/src/ATen/native/cuda/Loss.cu:247: nll_loss_forward_reduce_cuda_kernel_2d: b lock: [0,0,0], thread: [29,0,0] Assertion `t >= 0 && t < n_classes` failed. /media/nvidia/NVME/pytorch/pytorch-v1.10.0/aten/src/ATen/native/cuda/Loss.cu:247: nll_loss_forward_reduce_cuda_kernel_2d: b lock: [0,0,0], thread: [31,0,0] Assertion `t >= 0 && t < n_classes` failed. Traceback (most recent call last): File "train_ssd.py", line 410, in <module> val_loss, val_regression_loss, val_classification_loss = test(val_loader, net, criterion, DEVICE) File "train_ssd.py", line 206, in test regression_loss, classification_loss = criterion(confidence, locations, labels, boxes) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/jetson-inference/python/training/detection/ssd/vision/nn/multibox_loss.py", line 43, in forward predicted_locations = predicted_locations[pos_mask, :].reshape(-1, 4) RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

…i got the error “CUDA error: device-side assert triggered”. Did i miss something?

Jetpack Version: 4.6-b199
Python version: Python 3.6.9

Thank you

Hi,

Please make sure you have checked out the corresponding branch.
For example, using L4T-R32.7.1 for the JetPack 4.6.1.

Thanks

Hi, AastaLLL

I followed your link, and i think the content is the same as my original link.

Then, i just delete my dataset (and folder) in “/jetson-inference/python/training/detection/ssd/data”, reboot the Jetson Nano, download dataset again from Open Images (this time 3 times more pictures), and run the “train_ssd.py” command again. The training running flawlessy for 2 hours and 30 minutes before i kill the script with ctrl-c.

Thank you AastaLLL for the idea.

There is no update from you for a period, assuming this is not an issue anymore.
Hence, we are closing this topic. If need further support, please open a new one.
Thanks
~0507

Hi,

Just want to double-check.
Are you able to run the transfer learning on the Nano 2GB?

Thanks.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.