Dusty-nv jetson training custom data sets generating labels

thank you for clearing this up! I spent all morning getting tf2onnx working on jetson only to find that didnt work with my API trained model either lol!

I will sort out these labels and just re-train on your detectnet hopefully I will be able to get this deployed!!

Thanks again I have been pulling my hair out trying to solve this.

oh one q, when you say train on pc with discrete GPU what exactly do you mean by using descrete? is that an argument I don’t see it as an argument in tran ssd py.

I already have your repo running set up on my pc in pycharm with same environment libraries as my jetson.

Ok great so got it running however it is not liking the pth file I feed I think. I am using mobilenet v2 ssd lite from https://storage.googleapis.com/models-hao/mb2-ssd-lite-mp-0_686.pth

and this is the output:

2021-04-16 15:45:12 - Init from pretrained ssd models/mb2-ssd-lite-mp-0_686.pth
2021-04-16 15:45:12 - Took 0.04 seconds to load the model.
2021-04-16 15:45:12 - Learning rate: 0.01, Base net learning rate: 0.001, Extra Layers learning rate: 0.01.
2021-04-16 15:45:12 - Uses CosineAnnealingLR scheduler.
2021-04-16 15:45:12 - Start training from epoch 0.
C:\Users\muayt\.virtualenvs\jetson_dev-V1CBiXeI\lib\site-packages\torch\optim\lr_scheduler.py:136: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.
step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTo
rch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
  "https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)
Traceback (most recent call last):
  File "train_ssd.py", line 344, in <module>
    device=DEVICE, debug_steps=args.debug_steps, epoch=epoch)
  File "train_ssd.py", line 114, in train
    for i, data in enumerate(loader):
  File "C:\Users\muayt\.virtualenvs\jetson_dev-V1CBiXeI\lib\site-packages\torch\utils\data\dataloader.py", line 352, in __iter__
    return self._get_iterator()
  File "C:\Users\muayt\.virtualenvs\jetson_dev-V1CBiXeI\lib\site-packages\torch\utils\data\dataloader.py", line 294, in _get_iterator
    return _MultiProcessingDataLoaderIter(self)
  File "C:\Users\muayt\.virtualenvs\jetson_dev-V1CBiXeI\lib\site-packages\torch\utils\data\dataloader.py", line 801, in __init__
    w.start()
  File "c:\users\muayt\appdata\local\programs\python\python37\lib\multiprocessing\process.py", line 112, in start
    self._popen = self._Popen(self)
  File "c:\users\muayt\appdata\local\programs\python\python37\lib\multiprocessing\context.py", line 223, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "c:\users\muayt\appdata\local\programs\python\python37\lib\multiprocessing\context.py", line 322, in _Popen
    return Popen(process_obj)
  File "c:\users\muayt\appdata\local\programs\python\python37\lib\multiprocessing\popen_spawn_win32.py", line 89, in __init__
    reduction.dump(process_obj, to_child)
  File "c:\users\muayt\appdata\local\programs\python\python37\lib\multiprocessing\reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object 'TrainAugmentation.__init__.<locals>.<lambda>'

(jetson_dev) F:\Work\Highlight\TF\jetson_dev\jetson-inference-master\python\training\detection>Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "c:\users\muayt\appdata\local\programs\python\python37\lib\multiprocessing\spawn.py", line 105, in spawn_main
    exitcode = _main(fd)
  File "c:\users\muayt\appdata\local\programs\python\python37\lib\multiprocessing\spawn.py", line 115, in _main
    self = reduction.pickle.load(from_parent)
EOFError: Ran out of input

So close! hehe

I just mean that you have an NVIDIA GPU in your PC. A discrete GPU is a PCIe card that you have plugged into the PC (whereas Jetson has an integrated GPU). In theory you could train on your PC with just CPU, but it may take longer.

oh ok yes I have a quadro P220 in the pc

Use the mobilenet-v1-ssd-mp-0_675.pth from the docs. The last I checked, the ONNX export was not working with mb2-ssd-lite either. MB1 and MB2 get similar performance on Jetson. The main thing with MB2 is separable convolutions for resource-limited mobile platforms (think phones), and the Jetson’s GPU has no problem with MB1.

I will do that as I don’t want to waste any more time than I have to thanks!

For the first time at least, I would follow how it is done in the tutorial (with ssd-mobilenet-v1), as I haven’t verified the other models with mb2 and VGG backbones are working later on in the pipeline (during the ONNX export and TensorRT import). That pytorch-ssd code was forked and I use it mainly for ssd-mobilenet-v1 because that seems to work well.

After you train your model, you can use the eval_ssd.py script from pytorch-ssd repo to test your output .pth checkpoint on a test image. This will confirm that the PyTorch model itself is good. Then you can convert it to ONNX.

Thanks a lot Dusty. I have tried with the model pth from the docs and I am still getting this output:

2021-04-16 16:03:22 - Start training from epoch 0.
C:\Users\muayt\.virtualenvs\jetson_dev-V1CBiXeI\lib\site-packages\torch\optim\lr_scheduler.py:136: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.
step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTo
rch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
  "https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)
C:\Users\muayt\.virtualenvs\jetson_dev-V1CBiXeI\lib\site-packages\torch\nn\_reduction.py:44: UserWarning: size_average and reduce args will be deprecated, please use red
uction='sum' instead.
  warnings.warn(warning.format(ret))
2021-04-16 16:03:40 - Epoch: 0, Step: 10/147, Avg Loss: 8.6335, Avg Regression Loss 3.5030, Avg Classification Loss: 5.1305
Traceback (most recent call last):
  File "train_ssd.py", line 344, in <module>
    device=DEVICE, debug_steps=args.debug_steps, epoch=epoch)
  File "train_ssd.py", line 114, in train
    for i, data in enumerate(loader):
  File "C:\Users\muayt\.virtualenvs\jetson_dev-V1CBiXeI\lib\site-packages\torch\utils\data\dataloader.py", line 435, in __next__
    data = self._next_data()
  File "C:\Users\muayt\.virtualenvs\jetson_dev-V1CBiXeI\lib\site-packages\torch\utils\data\dataloader.py", line 475, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "C:\Users\muayt\.virtualenvs\jetson_dev-V1CBiXeI\lib\site-packages\torch\utils\data\_utils\fetch.py", line 44, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "C:\Users\muayt\.virtualenvs\jetson_dev-V1CBiXeI\lib\site-packages\torch\utils\data\_utils\fetch.py", line 44, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "C:\Users\muayt\.virtualenvs\jetson_dev-V1CBiXeI\lib\site-packages\torch\utils\data\dataset.py", line 218, in __getitem__
    return self.datasets[dataset_idx][sample_idx]
  File "F:\Work\Highlight\TF\jetson_dev\jetson-inference-master\python\training\detection\vision\datasets\voc_dataset.py", line 69, in __getitem__
    image, boxes, labels = self.transform(image, boxes, labels)
  File "F:\Work\Highlight\TF\jetson_dev\jetson-inference-master\python\training\detection\vision\ssd\data_preprocessing.py", line 34, in __call__
    return self.augment(img, boxes, labels)
  File "F:\Work\Highlight\TF\jetson_dev\jetson-inference-master\python\training\detection\vision\transforms\transforms.py", line 55, in __call__
    img, boxes, labels = t(img, boxes, labels)
  File "F:\Work\Highlight\TF\jetson_dev\jetson-inference-master\python\training\detection\vision\transforms\transforms.py", line 280, in __call__
    if overlap.min() < min_iou and max_iou < overlap.max():
  File "C:\Users\muayt\.virtualenvs\jetson_dev-V1CBiXeI\lib\site-packages\numpy\core\_methods.py", line 34, in _amin
    return umr_minimum(a, axis, None, out, keepdims, initial, where)
ValueError: zero-size array to reduction operation minimum which has no identity

At least its working to an extent :p

It might be there is an annotation XML in your dataset that is malformed/corrupted or has no bounding boxes, or names a class that is not in labels.txt

What I recommend is uncommenting this line of code (this will print out image ID’s and bounding box info as it is loaded):

https://github.com/dusty-nv/pytorch-ssd/blob/e7b5af50a157c50d3bab8f55089ce57c2c812f37/vision/datasets/voc_dataset.py#L76

And then run train_ssd.py with --batch-size=1 --workers=1 --debug-steps=1

Then when the exception happens, look above to see the most recent image ID that was printed out.
Then inspect that image’s XML file to see what is different about it (or just remove it from your ImageSet lists)

I had a look and the image exists as does its corresponding xml file. I searched for the error on google and found a github conversation that I now cannot find again!! typical, but the person mentioned if they have one label and it is set to difficult then they can recreate the error every time. They said you need to set keep_difficult to true. I dont know where this needs to be set. And I cannot find the convo again to ask!

I checked the xml file and low and behold it had difficult in the xml. Do you know where I need to turn this keep_difficult setting on in the code?

Thanks :)

EDIT: I set difficult to 0 in the xml and it processed that file fine until next one with difficult. Is difficult setting really that big a impact on the training do you think? If not I can write a quick script to simply go through all xmls and change it to 0. But I don’t want to do that if it will really impact the training of hard to discern labels.

EDIT: found your answer to this here: https://forums.developer.nvidia.com/t/successful-training-with-train-ssd-py-using-small-custom-data-set-but-error-on-full-data-set/156921/6

it now trains thank you so much for your help, is there any info generated by this trainer that can be loaded into tensorboard or would I have to code that into it myself?

Once again, thank you!!

Edit: FYI for future reference I had to set num workers to 0 for it to run. Otherwise the multiprocessor error would happen.

OK gotcha, glad you were able to get it working by setting keep_difficult=True. Regarding tensorboard, you would would need to install/integrate tensorboard yourself (presumably using torch.utils.tensorboard). The training metrics that you would send to tensorboard would probably be the same ones that get printed out to the console here:

top notch! thank you!

Hey sorry to bother you again and I can make this a new post if need be but I had further issues with difficult in validation set, then realised I need to put that into the val code.

However when I put keep_difficult into validation prep it throws an error. So I removed keep difficult (thought I would just edit the xml files since val set is not huge!) but even after removing keep difficult from it I am now still getting the error!

It was working fine before, doh!

I cannot for the life of me figure out what broke?? I even copied back the original train_ssd.py from the github to ensure code was the same and still the error happens! Baffled.

I am testing on smaller controlled training set for now so I can iron out any errors before the proper training can begin!

Output:

2021-04-22 15:02:01 - Prepare Validation datasets.
Traceback (most recent call last):
  File "train_ssd.py", line 244, in <module>
    target_transform=target_transform, is_test=True)
  File "F:\Work\Highlight\TF\jetson_dev\jetson-inference-master\python\training\detection\vision\datasets\voc_dataset.py", line 24, in __init__
    self.ids = self._read_image_ids(image_sets_file)
  File "F:\Work\Highlight\TF\jetson_dev\jetson-inference-master\python\training\detection\vision\datasets\voc_dataset.py", line 99, in _read_image_ids
    if self._get_num_annotations(image_id) > 0:
  File "F:\Work\Highlight\TF\jetson_dev\jetson-inference-master\python\training\detection\vision\datasets\voc_dataset.py", line 108, in _get_num_annotations
    objects = ET.parse(annotation_file).findall("object")
  File "c:\users\muayt\appdata\local\programs\python\python37\lib\xml\etree\ElementTree.py", line 1197, in parse
    tree.parse(source, parser)
  File "c:\users\muayt\appdata\local\programs\python\python37\lib\xml\etree\ElementTree.py", line 598, in parse
    self._root = parser._parse_whole(source)
xml.etree.ElementTree.ParseError: mismatched tag: line 20, column 13

code:

    # load datasets (could be multiple)
    logging.info("Prepare training datasets.")
    datasets = []
    for dataset_path in args.datasets:
        if args.dataset_type == 'voc':
            dataset = VOCDataset(dataset_path, keep_difficult=True, transform=train_transform,
                                 target_transform=target_transform)
            label_file = os.path.join(args.checkpoint_folder, "labels.txt")
            store_labels(label_file, dataset.class_names)
            num_classes = len(dataset.class_names)
        elif args.dataset_type == 'open_images':
            dataset = OpenImagesDataset(dataset_path,
                                        transform=train_transform, target_transform=target_transform,
                                        dataset_type="train", balance_data=args.balance_data)
            label_file = os.path.join(args.checkpoint_folder, "labels.txt")
            store_labels(label_file, dataset.class_names)
            logging.info(dataset)
            num_classes = len(dataset.class_names)

        else:
            raise ValueError(f"Dataset type {args.dataset_type} is not supported.")
        datasets.append(dataset)
        
    # create training dataset
    logging.info(f"Stored labels into file {label_file}.")
    train_dataset = ConcatDataset(datasets)
    logging.info("Train dataset size: {}".format(len(train_dataset)))
    train_loader = DataLoader(train_dataset, args.batch_size,
                              num_workers=args.num_workers,
                              shuffle=True)

    # create validation dataset                           
    logging.info("Prepare Validation datasets.")
    if args.dataset_type == "voc":
        val_dataset = VOCDataset(dataset_path, transform=test_transform,
                                 target_transform=target_transform, is_test=True)
    elif args.dataset_type == 'open_images':
        val_dataset = OpenImagesDataset(dataset_path,
                                        transform=test_transform, target_transform=target_transform,
                                        dataset_type="test")
        logging.info(val_dataset)
    logging.info("Validation dataset size: {}".format(len(val_dataset)))

    val_loader = DataLoader(val_dataset, args.batch_size,
                            num_workers=args.num_workers,
                            shuffle=False)

EDIT: Never mind it was a corrupt data in xml file issue. Fixed it :D It appears to be training now fingers crossed it goes the distance!

OK great! Yea, that was my first thought when I saw the mismatched tag error - typically that means a malformed XML file.

Oh man :/ so I finished training a test model. I used onnx_export on both pc and jetson and tried to load the onnx into detectnet…unfortunately it throws the same error as when I tried with my onnx from API model :/

So back to square one…sigh

any suggestions please? Cheers.

error:

[TRT]    TensorRT version 7.1.3
[TRT]    loading NVIDIA plugins...
[TRT]    Registered plugin creator - ::GridAnchor_TRT version 1
[TRT]    Registered plugin creator - ::NMS_TRT version 1
[TRT]    Registered plugin creator - ::Reorg_TRT version 1
[TRT]    Registered plugin creator - ::Region_TRT version 1
[TRT]    Registered plugin creator - ::Clip_TRT version 1
[TRT]    Registered plugin creator - ::LReLU_TRT version 1
[TRT]    Registered plugin creator - ::PriorBox_TRT version 1
[TRT]    Registered plugin creator - ::Normalize_TRT version 1
[TRT]    Registered plugin creator - ::RPROI_TRT version 1
[TRT]    Registered plugin creator - ::BatchedNMS_TRT version 1
[TRT]    Could not register plugin creator -  ::FlattenConcat_TRT version 1
[TRT]    Registered plugin creator - ::CropAndResize version 1
[TRT]    Registered plugin creator - ::DetectionLayer_TRT version 1
[TRT]    Registered plugin creator - ::Proposal version 1
[TRT]    Registered plugin creator - ::ProposalLayer_TRT version 1
[TRT]    Registered plugin creator - ::PyramidROIAlign_TRT version 1
[TRT]    Registered plugin creator - ::ResizeNearest_TRT version 1
[TRT]    Registered plugin creator - ::Split version 1
[TRT]    Registered plugin creator - ::SpecialSlice_TRT version 1
[TRT]    Registered plugin creator - ::InstanceNormalization_TRT version 1
[TRT]    detected model format - custom  (extension '.')
[TRT]    model format 'custom' not supported by jetson-inference
[TRT]    detectNet -- failed to initialize.
jetson.inference -- detectNet failed to load network
Traceback (most recent call last):
  File "/home/nvidia/Highlight/infer_onnx.py", line 8, in <module>
    net = jetson.inference.detectNet(argv=["--model=model_path", "--labels=labels_path", "--threshold=0.5", "--input-blob=input_0", "--output-cvg=scores", "--output-bbox=boxes"])
Exception: jetson.inference -- detectNet failed to load network

code:

import jetson.inference
import jetson.utils
import pathlib


model_path = pathlib.Path("first_test.onnx")
labels_path = pathlib.Path("object-detection.pbtxt")
net = jetson.inference.detectNet(argv=["--model=model_path", "--labels=labels_path", "--threshold=0.5", "--input-blob=input_0", "--output-cvg=scores", "--output-bbox=boxes"])

camera = jetson.utils.videoSource("file://test1.mp4")
display = jetson.utils.videoOutput("display://0")

while display.IsStreaming():
	img = camera.Capture()
	detections = net.Detect(img)
	display.Render(img)
	display.SetStatus("Object Detection | Network {:.0f} FPS".format(net.GetNetworkFPS()))

Try running it with just detectnet.py first on a test image (and then your test video)

I think your paths aren’t getting formatted into the arg strings correctly. First try hardcoding the actual paths in there. Then try something like:

net = jetson.inference.detectNet(argv=["--model={:s}".format(str(model_path)), "--labels={:s}".format(str(labels_path)), "--threshold=0.5", "--input-blob=input_0", "--output-cvg=scores", "--output-bbox=boxes"])

Also, what is object-detection.pbtxt file? The labels file is expected to be a simple text file, with one class per line - for example:

BACKGROUND
car
bike
person

The labels.txt that train_ssd.py outputs should always have BACKGROUND as the first class (it adds this background class), and you should be using that labels.txt file with detectnet.

Awesome it loads! and it works with detectnet.py.

One last question please and I can leave you alone and get on with it :D as I need to be able to control this training a bit more. The model I trained using the API was configured using the config files from model zoo, I had done 180k training steps took a long time.

I noticed this trainer does not specify steps but epochs. I am not sure how to translate my setup into this trainer? As clearly 30 epochs is nowhere near enough to train this. It should take more like 20 hours not 2.

This was my config file previously:

model {
  ssd {
    num_classes: 2
    box_coder {
      faster_rcnn_box_coder {
        y_scale: 10.0
        x_scale: 10.0
        height_scale: 5.0
        width_scale: 5.0
      }
    }
    matcher {
      argmax_matcher {
        matched_threshold: 0.5
        unmatched_threshold: 0.5
        ignore_thresholds: false
        negatives_lower_than_unmatched: true
        force_match_for_each_row: true
      }
    }
    similarity_calculator {
      iou_similarity {
      }
    }
    anchor_generator {
      ssd_anchor_generator {
        num_layers: 6
        min_scale: 0.2
        max_scale: 0.95
        aspect_ratios: 1.0
        aspect_ratios: 2.0
        aspect_ratios: 0.5
        aspect_ratios: 3.0
        aspect_ratios: 0.3333
      }
    }
    image_resizer {
      fixed_shape_resizer {
        height: 300
        width: 300
      }
    }
    box_predictor {
      convolutional_box_predictor {
        min_depth: 0
        max_depth: 0
        num_layers_before_predictor: 0
        use_dropout: false
        dropout_keep_probability: 0.8
        kernel_size: 1
        box_code_size: 4
        apply_sigmoid_to_scores: false
        conv_hyperparams {
          activation: RELU_6,
          regularizer {
            l2_regularizer {
              weight: 0.00004
            }
          }
          initializer {
            truncated_normal_initializer {
              stddev: 0.03
              mean: 0.0
            }
          }
          batch_norm {
            train: true,
            scale: true,
            center: true,
            decay: 0.9997,
            epsilon: 0.001,
          }
        }
      }
    }
    feature_extractor {
      type: 'ssd_mobilenet_v2'
      min_depth: 16
      depth_multiplier: 1.0
      conv_hyperparams {
        activation: RELU_6,
        regularizer {
          l2_regularizer {
            weight: 0.00004
          }
        }
        initializer {
          truncated_normal_initializer {
            stddev: 0.03
            mean: 0.0
          }
        }
        batch_norm {
          train: true,
          scale: true,
          center: true,
          decay: 0.9997,
          epsilon: 0.001,
        }
      }
    }
    loss {
      classification_loss {
        weighted_sigmoid {
        }
      }
      localization_loss {
        weighted_smooth_l1 {
        }
      }
      hard_example_miner {
        num_hard_examples: 3000
        iou_threshold: 0.99
        loss_type: CLASSIFICATION
        max_negatives_per_positive: 3
        min_negatives_per_image: 3
      }
      classification_weight: 1.0
      localization_weight: 1.0
    }
    normalize_loss_by_num_matches: true
    post_processing {
      batch_non_max_suppression {
        score_threshold: 1e-8
        iou_threshold: 0.6
        max_detections_per_class: 100
        max_total_detections: 100
      }
      score_converter: SIGMOID
    }
  }
}

train_config: {
  batch_size: 12
  optimizer {
    rms_prop_optimizer: {
      learning_rate: {
        exponential_decay_learning_rate {
          initial_learning_rate: 0.004
          decay_steps: 800720
          decay_factor: 0.95
        }
      }
      momentum_optimizer_value: 0.9
      decay: 0.9
      epsilon: 1.0
    }
  }
  fine_tune_checkpoint: "legacy/ssd_mobilenet_v1_coco_11_06_2017/model.ckpt"
  fine_tune_checkpoint_type:  "detection"
  num_steps: 180000
  data_augmentation_options {
    random_horizontal_flip {
    }
  }
  data_augmentation_options {
    ssd_random_crop {
    }
  }
}

train_input_reader: {
  tf_record_input_reader {
    input_path: "legacy/data/train.record"
  }
  label_map_path: "legacy/training/object-detection.pbtxt"
}

eval_config: {
  # num_examples: 8000
  # Note: The below line limits the evaluation process to 10 evaluations.
  # Remove the below line to evaluate indefinitely.
  num_visualizations: 45
  min_score_threshold: 0.35
  max_evals: 10
}

eval_input_reader: {
  tf_record_input_reader {
    input_path: "legacy/data/test.record"
  }
  label_map_path: "legacy/training/object-detection.pbtxt"
  shuffle: false
  num_readers: 1
}

I will be integrating tensorboard and will post the code when done for anyone else who wants it.

I’m not sure of what the parlance of steps is in TensorFlow, here is a post about it though:

https://stackoverflow.com/questions/38340311/what-is-the-difference-between-steps-and-epochs-in-tensorflow

It sounds like one ‘step’ in TF is one batch. In your config, you have a batch size of 12. So, 180K steps * 12 = 2160K images trained. Divide 2160K by the number of images in your training set, and that will give you the approximate number of epochs that would be equal.

I trained my models for 100 epochs and that seemed to be good.

superb thank you so much for all your help I have learned a lot in this exchange you have saved me!

Thanks Dusty!