Problems with train_ssd.py

I’m trying to follow this page on transfer learning for SSD-Mobilnet: https://github.com/dusty-nv/jetson-inference/blob/master/docs/pytorch-collect-detection.md

I’ve created a new folder in /home/james called net1 for my dataset. Camera-capture runs ok and I’ve captured some images which I’ve drawn bounding boxes on and saved, which I did mostly with the ‘merge sets’ button ticked. That seems to be ok and it’s created a bunch of folders for me called Annotations, ImageSets, JPEGimages, test, train, val. They all look like they have stuff in them they should have apart from test, train & val are empty, which I assume get populated during training.

I then ran:

$ cd jetson-inference/python/training/detection/ssd
$ python3 train_ssd.py --dataset-type=voc --data=/home/james/net1 --model-dir=/home/james/net1

I’m assuming my data and model paths are ok both being pointing into the same directory - this is the input and output right?

It runs, although there are warning some images don’t have annotations (not sure why - I captured many more images than the ones noted in any case). But then it appears to fail and go back to the command prompt with a bunch of ‘failed’ messages, some errors in some python scripts, and finally a runtime error. Output is as follows:

$ python3 train_ssd.py --dataset-type=voc --data=/home/james/net1 --model-dir=/home/james/net1
2020-09-16 10:55:23 - Using CUDA...
2020-09-16 10:55:23 - Namespace(balance_data=False, base_net=None, base_net_lr=0.001, batch_size=4, checkpoint_folder='/home/james/net1', dataset_type='voc', datasets=['/home/james/net1'], debug_steps=10, extra_layers_lr=None, freeze_base_net=False, freeze_net=False, gamma=0.1, lr=0.01, mb2_width_mult=1.0, milestones='80,100', momentum=0.9, net='mb1-ssd', num_epochs=30, num_workers=2, pretrained_ssd='models/mobilenet-v1-ssd-mp-0_675.pth', resume=None, scheduler='cosine', t_max=100, use_cuda=True, validation_epochs=1, weight_decay=0.0005)
2020-09-16 10:55:24 - Prepare training datasets.
warning - image 20200916-101518 has no box/labels annotations, ignoring from dataset
warning - image 20200916-101701 has no box/labels annotations, ignoring from dataset
warning - image 20200916-101703 has no box/labels annotations, ignoring from dataset
warning - image 20200916-101811 has no box/labels annotations, ignoring from dataset
warning - image 20200916-101816 has no box/labels annotations, ignoring from dataset
warning - image 20200916-102235 has no box/labels annotations, ignoring from dataset
2020-09-16 10:55:24 - VOC Labels read from file: ('BACKGROUND', 'BACKGROUND', 'BACKGROUND', 'triangle', 'square', 'circle')
2020-09-16 10:55:24 - Stored labels into file /home/james/net1/labels.txt.
2020-09-16 10:55:24 - Train dataset size: 68
2020-09-16 10:55:24 - Prepare Validation datasets.
warning - image 20200916-101518 has no box/labels annotations, ignoring from dataset
warning - image 20200916-101701 has no box/labels annotations, ignoring from dataset
warning - image 20200916-101703 has no box/labels annotations, ignoring from dataset
warning - image 20200916-101811 has no box/labels annotations, ignoring from dataset
warning - image 20200916-101816 has no box/labels annotations, ignoring from dataset
2020-09-16 10:55:24 - VOC Labels read from file: ('BACKGROUND', 'BACKGROUND', 'BACKGROUND', 'BACKGROUND', 'triangle', 'square', 'circle')
2020-09-16 10:55:24 - Validation dataset size: 22
2020-09-16 10:55:24 - Build network.
2020-09-16 10:55:24 - Init from pretrained ssd models/mobilenet-v1-ssd-mp-0_675.pth
2020-09-16 10:55:24 - Took 0.51 seconds to load the model.
2020-09-16 10:55:40 - Learning rate: 0.01, Base net learning rate: 0.001, Extra Layers learning rate: 0.01.
2020-09-16 10:55:40 - Uses CosineAnnealingLR scheduler.
2020-09-16 10:55:40 - Start training from epoch 0.
/home/james/.local/lib/python3.6/site-packages/torch/optim/lr_scheduler.py:123: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
  "https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)
/home/james/.local/lib/python3.6/site-packages/torch/nn/_reduction.py:44: UserWarning: size_average and reduce args will be deprecated, please use reduction='sum' instead.
  warnings.warn(warning.format(ret))
2020-09-16 10:56:54 - Epoch: 0, Step: 10/17, Avg Loss: 10.6713, Avg Regression Loss 3.0818, Avg Classification Loss: 7.5895
/media/nvidia/WD_NVME/PyTorch/JetPack_4.4/GA/pytorch-v1.6.0/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [3,0,0] Assertion `t >= 0 && t < n_classes` failed.
/media/nvidia/WD_NVME/PyTorch/JetPack_4.4/GA/pytorch-v1.6.0/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [5,0,0] Assertion `t >= 0 && t < n_classes` failed.
/media/nvidia/WD_NVME/PyTorch/JetPack_4.4/GA/pytorch-v1.6.0/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [6,0,0] Assertion `t >= 0 && t < n_classes` failed.
/media/nvidia/WD_NVME/PyTorch/JetPack_4.4/GA/pytorch-v1.6.0/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [7,0,0] Assertion `t >= 0 && t < n_classes` failed.
/media/nvidia/WD_NVME/PyTorch/JetPack_4.4/GA/pytorch-v1.6.0/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [8,0,0] Assertion `t >= 0 && t < n_classes` failed.
/media/nvidia/WD_NVME/PyTorch/JetPack_4.4/GA/pytorch-v1.6.0/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [12,0,0] Assertion `t >= 0 && t < n_classes` failed.
/media/nvidia/WD_NVME/PyTorch/JetPack_4.4/GA/pytorch-v1.6.0/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [13,0,0] Assertion `t >= 0 && t < n_classes` failed.
/media/nvidia/WD_NVME/PyTorch/JetPack_4.4/GA/pytorch-v1.6.0/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [14,0,0] Assertion `t >= 0 && t < n_classes` failed.
/media/nvidia/WD_NVME/PyTorch/JetPack_4.4/GA/pytorch-v1.6.0/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [15,0,0] Assertion `t >= 0 && t < n_classes` failed.
Traceback (most recent call last):
  File "train_ssd.py", line 346, in <module>
    val_loss, val_regression_loss, val_classification_loss = test(val_loader, net, criterion, DEVICE)
  File "train_ssd.py", line 159, in test
    regression_loss, classification_loss = criterion(confidence, locations, labels, boxes)
  File "/home/james/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/james/jetson-inference/python/training/detection/ssd/vision/nn/multibox_loss.py", line 43, in forward
    predicted_locations = predicted_locations[pos_mask, :].reshape(-1, 4)
RuntimeError: copy_if failed to synchronize: cudaErrorAssert: device-side assert triggered

I already updated my Repo (Jetpack 4.4) as per my other thread: 10 lines of code example out of date?

Ignore this for now - I used a separate folder for the model output and it’s running ok I think.

I know there’s a mention of pointing to the right labels.txt when you run the model because it adds the background to it, but it’s possibly worth explicitly stating to use a separate location in the guide.