Dusty-nv jetson training custom data sets generating labels

Hey sorry to bother you again and I can make this a new post if need be but I had further issues with difficult in validation set, then realised I need to put that into the val code.

However when I put keep_difficult into validation prep it throws an error. So I removed keep difficult (thought I would just edit the xml files since val set is not huge!) but even after removing keep difficult from it I am now still getting the error!

It was working fine before, doh!

I cannot for the life of me figure out what broke?? I even copied back the original train_ssd.py from the github to ensure code was the same and still the error happens! Baffled.

I am testing on smaller controlled training set for now so I can iron out any errors before the proper training can begin!

Output:

2021-04-22 15:02:01 - Prepare Validation datasets.
Traceback (most recent call last):
  File "train_ssd.py", line 244, in <module>
    target_transform=target_transform, is_test=True)
  File "F:\Work\Highlight\TF\jetson_dev\jetson-inference-master\python\training\detection\vision\datasets\voc_dataset.py", line 24, in __init__
    self.ids = self._read_image_ids(image_sets_file)
  File "F:\Work\Highlight\TF\jetson_dev\jetson-inference-master\python\training\detection\vision\datasets\voc_dataset.py", line 99, in _read_image_ids
    if self._get_num_annotations(image_id) > 0:
  File "F:\Work\Highlight\TF\jetson_dev\jetson-inference-master\python\training\detection\vision\datasets\voc_dataset.py", line 108, in _get_num_annotations
    objects = ET.parse(annotation_file).findall("object")
  File "c:\users\muayt\appdata\local\programs\python\python37\lib\xml\etree\ElementTree.py", line 1197, in parse
    tree.parse(source, parser)
  File "c:\users\muayt\appdata\local\programs\python\python37\lib\xml\etree\ElementTree.py", line 598, in parse
    self._root = parser._parse_whole(source)
xml.etree.ElementTree.ParseError: mismatched tag: line 20, column 13

code:

    # load datasets (could be multiple)
    logging.info("Prepare training datasets.")
    datasets = []
    for dataset_path in args.datasets:
        if args.dataset_type == 'voc':
            dataset = VOCDataset(dataset_path, keep_difficult=True, transform=train_transform,
                                 target_transform=target_transform)
            label_file = os.path.join(args.checkpoint_folder, "labels.txt")
            store_labels(label_file, dataset.class_names)
            num_classes = len(dataset.class_names)
        elif args.dataset_type == 'open_images':
            dataset = OpenImagesDataset(dataset_path,
                                        transform=train_transform, target_transform=target_transform,
                                        dataset_type="train", balance_data=args.balance_data)
            label_file = os.path.join(args.checkpoint_folder, "labels.txt")
            store_labels(label_file, dataset.class_names)
            logging.info(dataset)
            num_classes = len(dataset.class_names)

        else:
            raise ValueError(f"Dataset type {args.dataset_type} is not supported.")
        datasets.append(dataset)
        
    # create training dataset
    logging.info(f"Stored labels into file {label_file}.")
    train_dataset = ConcatDataset(datasets)
    logging.info("Train dataset size: {}".format(len(train_dataset)))
    train_loader = DataLoader(train_dataset, args.batch_size,
                              num_workers=args.num_workers,
                              shuffle=True)

    # create validation dataset                           
    logging.info("Prepare Validation datasets.")
    if args.dataset_type == "voc":
        val_dataset = VOCDataset(dataset_path, transform=test_transform,
                                 target_transform=target_transform, is_test=True)
    elif args.dataset_type == 'open_images':
        val_dataset = OpenImagesDataset(dataset_path,
                                        transform=test_transform, target_transform=target_transform,
                                        dataset_type="test")
        logging.info(val_dataset)
    logging.info("Validation dataset size: {}".format(len(val_dataset)))

    val_loader = DataLoader(val_dataset, args.batch_size,
                            num_workers=args.num_workers,
                            shuffle=False)

EDIT: Never mind it was a corrupt data in xml file issue. Fixed it :D It appears to be training now fingers crossed it goes the distance!

OK great! Yea, that was my first thought when I saw the mismatched tag error - typically that means a malformed XML file.

Oh man :/ so I finished training a test model. I used onnx_export on both pc and jetson and tried to load the onnx into detectnet…unfortunately it throws the same error as when I tried with my onnx from API model :/

So back to square one…sigh

any suggestions please? Cheers.

error:

[TRT]    TensorRT version 7.1.3
[TRT]    loading NVIDIA plugins...
[TRT]    Registered plugin creator - ::GridAnchor_TRT version 1
[TRT]    Registered plugin creator - ::NMS_TRT version 1
[TRT]    Registered plugin creator - ::Reorg_TRT version 1
[TRT]    Registered plugin creator - ::Region_TRT version 1
[TRT]    Registered plugin creator - ::Clip_TRT version 1
[TRT]    Registered plugin creator - ::LReLU_TRT version 1
[TRT]    Registered plugin creator - ::PriorBox_TRT version 1
[TRT]    Registered plugin creator - ::Normalize_TRT version 1
[TRT]    Registered plugin creator - ::RPROI_TRT version 1
[TRT]    Registered plugin creator - ::BatchedNMS_TRT version 1
[TRT]    Could not register plugin creator -  ::FlattenConcat_TRT version 1
[TRT]    Registered plugin creator - ::CropAndResize version 1
[TRT]    Registered plugin creator - ::DetectionLayer_TRT version 1
[TRT]    Registered plugin creator - ::Proposal version 1
[TRT]    Registered plugin creator - ::ProposalLayer_TRT version 1
[TRT]    Registered plugin creator - ::PyramidROIAlign_TRT version 1
[TRT]    Registered plugin creator - ::ResizeNearest_TRT version 1
[TRT]    Registered plugin creator - ::Split version 1
[TRT]    Registered plugin creator - ::SpecialSlice_TRT version 1
[TRT]    Registered plugin creator - ::InstanceNormalization_TRT version 1
[TRT]    detected model format - custom  (extension '.')
[TRT]    model format 'custom' not supported by jetson-inference
[TRT]    detectNet -- failed to initialize.
jetson.inference -- detectNet failed to load network
Traceback (most recent call last):
  File "/home/nvidia/Highlight/infer_onnx.py", line 8, in <module>
    net = jetson.inference.detectNet(argv=["--model=model_path", "--labels=labels_path", "--threshold=0.5", "--input-blob=input_0", "--output-cvg=scores", "--output-bbox=boxes"])
Exception: jetson.inference -- detectNet failed to load network

code:

import jetson.inference
import jetson.utils
import pathlib


model_path = pathlib.Path("first_test.onnx")
labels_path = pathlib.Path("object-detection.pbtxt")
net = jetson.inference.detectNet(argv=["--model=model_path", "--labels=labels_path", "--threshold=0.5", "--input-blob=input_0", "--output-cvg=scores", "--output-bbox=boxes"])

camera = jetson.utils.videoSource("file://test1.mp4")
display = jetson.utils.videoOutput("display://0")

while display.IsStreaming():
	img = camera.Capture()
	detections = net.Detect(img)
	display.Render(img)
	display.SetStatus("Object Detection | Network {:.0f} FPS".format(net.GetNetworkFPS()))

Try running it with just detectnet.py first on a test image (and then your test video)

I think your paths aren’t getting formatted into the arg strings correctly. First try hardcoding the actual paths in there. Then try something like:

net = jetson.inference.detectNet(argv=["--model={:s}".format(str(model_path)), "--labels={:s}".format(str(labels_path)), "--threshold=0.5", "--input-blob=input_0", "--output-cvg=scores", "--output-bbox=boxes"])

Also, what is object-detection.pbtxt file? The labels file is expected to be a simple text file, with one class per line - for example:

BACKGROUND
car
bike
person

The labels.txt that train_ssd.py outputs should always have BACKGROUND as the first class (it adds this background class), and you should be using that labels.txt file with detectnet.

Awesome it loads! and it works with detectnet.py.

One last question please and I can leave you alone and get on with it :D as I need to be able to control this training a bit more. The model I trained using the API was configured using the config files from model zoo, I had done 180k training steps took a long time.

I noticed this trainer does not specify steps but epochs. I am not sure how to translate my setup into this trainer? As clearly 30 epochs is nowhere near enough to train this. It should take more like 20 hours not 2.

This was my config file previously:

model {
  ssd {
    num_classes: 2
    box_coder {
      faster_rcnn_box_coder {
        y_scale: 10.0
        x_scale: 10.0
        height_scale: 5.0
        width_scale: 5.0
      }
    }
    matcher {
      argmax_matcher {
        matched_threshold: 0.5
        unmatched_threshold: 0.5
        ignore_thresholds: false
        negatives_lower_than_unmatched: true
        force_match_for_each_row: true
      }
    }
    similarity_calculator {
      iou_similarity {
      }
    }
    anchor_generator {
      ssd_anchor_generator {
        num_layers: 6
        min_scale: 0.2
        max_scale: 0.95
        aspect_ratios: 1.0
        aspect_ratios: 2.0
        aspect_ratios: 0.5
        aspect_ratios: 3.0
        aspect_ratios: 0.3333
      }
    }
    image_resizer {
      fixed_shape_resizer {
        height: 300
        width: 300
      }
    }
    box_predictor {
      convolutional_box_predictor {
        min_depth: 0
        max_depth: 0
        num_layers_before_predictor: 0
        use_dropout: false
        dropout_keep_probability: 0.8
        kernel_size: 1
        box_code_size: 4
        apply_sigmoid_to_scores: false
        conv_hyperparams {
          activation: RELU_6,
          regularizer {
            l2_regularizer {
              weight: 0.00004
            }
          }
          initializer {
            truncated_normal_initializer {
              stddev: 0.03
              mean: 0.0
            }
          }
          batch_norm {
            train: true,
            scale: true,
            center: true,
            decay: 0.9997,
            epsilon: 0.001,
          }
        }
      }
    }
    feature_extractor {
      type: 'ssd_mobilenet_v2'
      min_depth: 16
      depth_multiplier: 1.0
      conv_hyperparams {
        activation: RELU_6,
        regularizer {
          l2_regularizer {
            weight: 0.00004
          }
        }
        initializer {
          truncated_normal_initializer {
            stddev: 0.03
            mean: 0.0
          }
        }
        batch_norm {
          train: true,
          scale: true,
          center: true,
          decay: 0.9997,
          epsilon: 0.001,
        }
      }
    }
    loss {
      classification_loss {
        weighted_sigmoid {
        }
      }
      localization_loss {
        weighted_smooth_l1 {
        }
      }
      hard_example_miner {
        num_hard_examples: 3000
        iou_threshold: 0.99
        loss_type: CLASSIFICATION
        max_negatives_per_positive: 3
        min_negatives_per_image: 3
      }
      classification_weight: 1.0
      localization_weight: 1.0
    }
    normalize_loss_by_num_matches: true
    post_processing {
      batch_non_max_suppression {
        score_threshold: 1e-8
        iou_threshold: 0.6
        max_detections_per_class: 100
        max_total_detections: 100
      }
      score_converter: SIGMOID
    }
  }
}

train_config: {
  batch_size: 12
  optimizer {
    rms_prop_optimizer: {
      learning_rate: {
        exponential_decay_learning_rate {
          initial_learning_rate: 0.004
          decay_steps: 800720
          decay_factor: 0.95
        }
      }
      momentum_optimizer_value: 0.9
      decay: 0.9
      epsilon: 1.0
    }
  }
  fine_tune_checkpoint: "legacy/ssd_mobilenet_v1_coco_11_06_2017/model.ckpt"
  fine_tune_checkpoint_type:  "detection"
  num_steps: 180000
  data_augmentation_options {
    random_horizontal_flip {
    }
  }
  data_augmentation_options {
    ssd_random_crop {
    }
  }
}

train_input_reader: {
  tf_record_input_reader {
    input_path: "legacy/data/train.record"
  }
  label_map_path: "legacy/training/object-detection.pbtxt"
}

eval_config: {
  # num_examples: 8000
  # Note: The below line limits the evaluation process to 10 evaluations.
  # Remove the below line to evaluate indefinitely.
  num_visualizations: 45
  min_score_threshold: 0.35
  max_evals: 10
}

eval_input_reader: {
  tf_record_input_reader {
    input_path: "legacy/data/test.record"
  }
  label_map_path: "legacy/training/object-detection.pbtxt"
  shuffle: false
  num_readers: 1
}

I will be integrating tensorboard and will post the code when done for anyone else who wants it.

I’m not sure of what the parlance of steps is in TensorFlow, here is a post about it though:

https://stackoverflow.com/questions/38340311/what-is-the-difference-between-steps-and-epochs-in-tensorflow

It sounds like one ‘step’ in TF is one batch. In your config, you have a batch size of 12. So, 180K steps * 12 = 2160K images trained. Divide 2160K by the number of images in your training set, and that will give you the approximate number of epochs that would be equal.

I trained my models for 100 epochs and that seemed to be good.

superb thank you so much for all your help I have learned a lot in this exchange you have saved me!

Thanks Dusty!