Hi. I am following this documentation ( jetson-inference/docs/pytorch-ssd.md at master · dusty-nv/jetson-inference · GitHub ) on re-training SSD model. When i execute the code below…
python3 train_ssd.py --data=data/fruit --model-dir=models/fruit --batch-size=4 --epochs=30
…the output is…
2025-04-12 06:00:02 - Using CUDA...
2025-04-12 06:00:02 - Namespace(balance_data=False, base_net=None, base_net_lr=0.001, batch_size=4, checkpoint_folder='mode ls/fruit', dataset_type='open_images', datasets=['data/fruit'], debug_steps=10, extra_layers_lr=None, freeze_base_net=False , freeze_net=False, gamma=0.1, log_level='info', lr=0.01, mb2_width_mult=1.0, milestones='80,100', momentum=0.9, net='mb1-s sd', num_epochs=30, num_workers=2, pretrained_ssd='models/mobilenet-v1-ssd-mp-0_675.pth', resolution=300, resume=None, sche duler='cosine', t_max=100, use_cuda=True, validation_epochs=1, validation_mean_ap=False, weight_decay=0.0005)
2025-04-12 06:01:06 - model resolution 300x300
2025-04-12 06:01:07 - SSDSpec(feature_map_size=19, shrinkage=16, box_sizes=SSDBoxSizes(min=60, max=105), aspect_ratios=[2, 3])
2025-04-12 06:01:07 - SSDSpec(feature_map_size=10, shrinkage=32, box_sizes=SSDBoxSizes(min=105, max=150), aspect_ratios=[2, 3])
2025-04-12 06:01:07 - SSDSpec(feature_map_size=5, shrinkage=64, box_sizes=SSDBoxSizes(min=150, max=195), aspect_ratios=[2, 3])
2025-04-12 06:01:07 - SSDSpec(feature_map_size=3, shrinkage=100, box_sizes=SSDBoxSizes(min=195, max=240), aspect_ratios=[2, 3])
2025-04-12 06:01:07 - SSDSpec(feature_map_size=2, shrinkage=150, box_sizes=SSDBoxSizes(min=240, max=285), aspect_ratios=[2, 3])
2025-04-12 06:01:07 - SSDSpec(feature_map_size=1, shrinkage=300, box_sizes=SSDBoxSizes(min=285, max=330), aspect_ratios=[2, 3])
2025-04-12 06:01:09 - Prepare training datasets.
2025-04-12 06:01:09 - loading annotations from: data/fruit/sub-train-annotations-bbox.csv
2025-04-12 06:01:11 - annotations loaded from: data/fruit/sub-train-annotations-bbox.csv
num images: 10
2025-04-12 06:01:11 - Dataset Summary:Number of Images: 10
Minimum Number of Images for a Class: -1
Label Distribution:
Apple: 3
Grape: 1
Orange: 86
Strawberry: 3
Watermelon: 8
2025-04-12 06:01:11 - Stored labels into file models/fruit/labels.txt.
2025-04-12 06:01:11 - Train dataset size: 10
2025-04-12 06:01:11 - Prepare Validation datasets.
2025-04-12 06:01:11 - loading annotations from: data/fruit/sub-test-annotations-bbox.csv
2025-04-12 06:01:11 - annotations loaded from: data/fruit/sub-test-annotations-bbox.csv
num images: 930
2025-04-12 06:01:14 - Dataset Summary:Number of Images: 930
Minimum Number of Images for a Class: -1
Label Distribution:
Apple: 329
Banana: 132
Grape: 446
Orange: 826
Pear: 107
Pineapple: 105
Strawberry: 754
Watermelon: 125
2025-04-12 06:01:14 - Validation dataset size: 930
2025-04-12 06:01:14 - Build network.
2025-04-12 06:01:16 - Init from pretrained SSD models/mobilenet-v1-ssd-mp-0_675.pth
models/mobilenet-v1-ssd-mp-0_6 100%[===================================================>] 36.23M 3.46MB/s in 13s
2025-04-12 06:01:34 - Took 17.71 seconds to load the model.
2025-04-12 06:01:34 - Learning rate: 0.01, Base net learning rate: 0.001, Extra Layers learning rate: 0.01.
2025-04-12 06:01:34 - Uses CosineAnnealingLR scheduler.
2025-04-12 06:01:34 - Start training from epoch 0.
/usr/local/lib/python3.6/dist-packages/torch/nn/_reduction.py:42: UserWarning: size_average and reduce args will be depreca ted, please use reduction='sum' instead.
warnings.warn(warning.format(ret))
2025-04-12 06:04:32 - Epoch: 0, Training Loss: 15.0934, Training Regression Loss 5.5559, Training Classification Loss: 9.53 75
/media/nvidia/NVME/pytorch/pytorch-v1.10.0/aten/src/ATen/native/cuda/Loss.cu:247: nll_loss_forward_reduce_cuda_kernel_2d: b lock: [0,0,0], thread: [13,0,0] Assertion `t >= 0 && t < n_classes` failed.
/media/nvidia/NVME/pytorch/pytorch-v1.10.0/aten/src/ATen/native/cuda/Loss.cu:247: nll_loss_forward_reduce_cuda_kernel_2d: b lock: [0,0,0], thread: [14,0,0] Assertion `t >= 0 && t < n_classes` failed.
/media/nvidia/NVME/pytorch/pytorch-v1.10.0/aten/src/ATen/native/cuda/Loss.cu:247: nll_loss_forward_reduce_cuda_kernel_2d: b lock: [0,0,0], thread: [15,0,0] Assertion `t >= 0 && t < n_classes` failed.
/media/nvidia/NVME/pytorch/pytorch-v1.10.0/aten/src/ATen/native/cuda/Loss.cu:247: nll_loss_forward_reduce_cuda_kernel_2d: b lock: [0,0,0], thread: [18,0,0] Assertion `t >= 0 && t < n_classes` failed.
/media/nvidia/NVME/pytorch/pytorch-v1.10.0/aten/src/ATen/native/cuda/Loss.cu:247: nll_loss_forward_reduce_cuda_kernel_2d: b lock: [0,0,0], thread: [19,0,0] Assertion `t >= 0 && t < n_classes` failed.
/media/nvidia/NVME/pytorch/pytorch-v1.10.0/aten/src/ATen/native/cuda/Loss.cu:247: nll_loss_forward_reduce_cuda_kernel_2d: b lock: [0,0,0], thread: [20,0,0] Assertion `t >= 0 && t < n_classes` failed.
/media/nvidia/NVME/pytorch/pytorch-v1.10.0/aten/src/ATen/native/cuda/Loss.cu:247: nll_loss_forward_reduce_cuda_kernel_2d: b lock: [0,0,0], thread: [21,0,0] Assertion `t >= 0 && t < n_classes` failed.
/media/nvidia/NVME/pytorch/pytorch-v1.10.0/aten/src/ATen/native/cuda/Loss.cu:247: nll_loss_forward_reduce_cuda_kernel_2d: b lock: [0,0,0], thread: [27,0,0] Assertion `t >= 0 && t < n_classes` failed.
/media/nvidia/NVME/pytorch/pytorch-v1.10.0/aten/src/ATen/native/cuda/Loss.cu:247: nll_loss_forward_reduce_cuda_kernel_2d: b lock: [0,0,0], thread: [28,0,0] Assertion `t >= 0 && t < n_classes` failed.
/media/nvidia/NVME/pytorch/pytorch-v1.10.0/aten/src/ATen/native/cuda/Loss.cu:247: nll_loss_forward_reduce_cuda_kernel_2d: b lock: [0,0,0], thread: [29,0,0] Assertion `t >= 0 && t < n_classes` failed.
/media/nvidia/NVME/pytorch/pytorch-v1.10.0/aten/src/ATen/native/cuda/Loss.cu:247: nll_loss_forward_reduce_cuda_kernel_2d: b lock: [0,0,0], thread: [31,0,0] Assertion `t >= 0 && t < n_classes` failed.
Traceback (most recent call last):
File "train_ssd.py", line 410, in <module>
val_loss, val_regression_loss, val_classification_loss = test(val_loader, net, criterion, DEVICE)
File "train_ssd.py", line 206, in test
regression_loss, classification_loss = criterion(confidence, locations, labels, boxes)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/jetson-inference/python/training/detection/ssd/vision/nn/multibox_loss.py", line 43, in forward
predicted_locations = predicted_locations[pos_mask, :].reshape(-1, 4)
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
…i got the error “CUDA error: device-side assert triggered”. Did i miss something?
Jetpack Version: 4.6-b199
Python version: Python 3.6.9
Thank you