Label files generated by tlt-infer

I have pruned and retrained a Detectnet_v2 model using my own data for detecting the level of liquid in a tank and ended up with this for the precision for my classes:

Validation cost: 0.000003
Mean average_precision (in %): 99.2907

class name      average precision (in %)
------------  --------------------------
high                             98.9247
low                              99.6139
medium                           99.3334

I then went to run tlt-infer on the testing data set and most of the generated images had a bounding box for two or more of the classes when there is only one tank in the image. I was expecting to generate only a single bounding box. Here is an example of a label file generated for an image where the tank level is clearly high:

low 0.00 0 0.00 53.525 146.285 469.457 436.788 0.00 0.00 0.00 0.00 0.00 0.00 0.00 7.626
medium 0.00 0 0.00 69.721 73.574 467.763 365.560 0.00 0.00 0.00 0.00 0.00 0.00 0.00 11.854
high 0.00 0 0.00 55.007 144.368 469.227 441.838 0.00 0.00 0.00 0.00 0.00 0.00 0.00 275.845

When the tlt-infer generates the label files, what is the value at the end of the line for each class referring to? It seems that it is a confidence value of sort, but I am unsure as it ranges quite a bit when I checked other label files and I was expecting a value between 0-1.0. How can I configure the tlt-infer so that it only generates the bounding box for a single class with the highest confidence level? Here is my inference spec sheet:

inferencer_config{
  target_classes: "low"
  target_classes: "medium"
  target_classes: "high"
  image_width: 704
  image_height: 480
  image_channels: 3
  batch_size: 8
  gpu_index: 0
  tlt_config{
    model: "/workspace/tlt-experiments/detectnet_v2/experiment_dir_retrain/weights/resnet18_detector_pruned.tlt"
  }
}
bbox_handler_config{
  kitti_dump: true
  disable_overlay: false
  overlay_linewidth: 2
  classwise_bbox_handler_config{
    key:"low"
    value: {
      confidence_model: "aggregate_cov"
      output_map: "low"
      confidence_threshold: 0.9
      bbox_color{
        R: 255
        G: 0
        B: 0
      }
      clustering_config{
        coverage_threshold: 0.00
        dbscan_eps: 0.3
        dbscan_min_samples: 0.05
        minimum_bounding_box_height: 4
      }
    }
  }
  classwise_bbox_handler_config{
    key:"medium"
    value: {
      confidence_model: "aggregate_cov"
      output_map: "medium"
      confidence_threshold: 0.9
      bbox_color{
        R: 0
        G: 255
        B: 0
      }
      clustering_config{
        coverage_threshold: 0.00
        dbscan_eps: 0.7
        dbscan_min_samples: 0.05
        minimum_bounding_box_height: 4
      }
    }
  }
  classwise_bbox_handler_config{
    key:"high"
    value: {
      confidence_model: "aggregate_cov"
      output_map: "high"
      confidence_threshold: 0.9
      bbox_color{
        R: 0
        G: 0
        B: 255
      }
      clustering_config{
        coverage_threshold: 0.00
        dbscan_eps: 0.7
        dbscan_min_samples: 0.05
        minimum_bounding_box_height: 4
      }
    }
  }
  classwise_bbox_handler_config{
    key:"default"
    value: {
      confidence_model: "aggregate_cov"
      confidence_threshold: 0.9
      bbox_color{
        R: 255
        G: 255
        B: 0
      }
      clustering_config{
        coverage_threshold: 0.00
        dbscan_eps: 0.7
        dbscan_min_samples: 0.05
        minimum_bounding_box_height: 4
      }
    }
  }
}

There are two modes mentioned in tlt user guide. See https://docs.nvidia.com/metropolis/TLT/tlt-getting-started-guide/index.html#bbox_handler

In mean_cov mode, the final confidence is the mean confidence of all the bboxes in the cluster.

In mean_cov: 0.0 - 1.0

Thank you for pointing to where that was in the TLT documentation. It makes a lot more sense where that value was coming from now.

I also have a question about the parameters for the cost function for the training spec sheet. In the TLT documentation, its recommended to not change the values for the classes in the examples, but how should we modify the parameters when using our own data set using different classes specifically the class weight and the initial/target weights? I tried to find information on this, but the best that I could come up with was to set the weight to represent the class’s percentage in the data set. What is your recommendation for this?

You can find some info in NVIDIA Metropolis Documentation

Thank you that link does clarify how I should define the class weights. I am still a little confused on how to choose the initial and target weighs for the objectives within each class, as I’m unsure what those parameters represent. From other people’s spec sheets, it seems like these are commonly used settings:


    coverage_foreground_weight: 0.0500000007451
    objectives {
      name: "cov"
      initial_weight: 1.0
      weight_target: 1.0
    }
    objectives {
      name: "bbox"
      initial_weight: 10.0
      weight_target: 10.0
    }

Is it best to just stick with those settings? Also I tried to export the model as in both fp16 and int8 data types and got this error:


AttributeError: Specified FP16 but not supported on platform.

and had to settle for fp32. I plan to export this to a Jetson Nano and worry about compatibility/performance issues using a fp32 model. Is it possible to run the tlt-export command on the Jetson Nano with the .tlt model to create the .etlt model there?

For initial_weight and weight_target, see reference; How to set initial_weight and weight_target at detectnet_v2 spec file?

For “AttributeError: Specified FP16 but not supported on platform.”, I am afraid the gpu inside your host PC does not support fp16. See more in Support Matrix :: NVIDIA Deep Learning TensorRT Documentation and CUDA GPUs - Compute Capability | NVIDIA Developer

Note that all the etlt model is fp32 mode even you set any other mode when you run tlt-export. You can run tlt-export in the host PC and generate the etlt model.

For Nano, it supports fp32 and fp16. You can deploy the etlt model in it.
Or, you can use tlt-convert to generate trt engine for deployment.

You are right, my GPU doesn’t support either those formats so I’ve purchased a more update to date GPU so that I can experiment with the other two formats. In the mean time, I’ve gone forward and imported the model to the Jetson Nano, converted it into the .engine file, but I’m having trouble getting the same detection as I did in the Jupyter Notebook in Deepstream. I’ll move over now to the Deepstream forums to bring up my issues there.

Thank you for your help Morganh.

I’ve gone back and trained another model (a ResNet18 model using the Detectnet_v2 object detection architecture), modifying the spec sheets using the information you shared Morganh, but I’m still having problems with the inference. My new model has a similar mean average precision as my first model, but it still performs badly in inference using the mean_cov confidence model. Here is some information on my data set that may help solve my problems. The data set contains images of a tank containing various levels of a liquid captured throughout the day and night. All of the images used for training and testing have the same angle on the tank. The dataset is split up into the following classes:

Low: 2165
Med 1379
High: 1276

Also here is my new training spec sheet:


random_seed: 42
dataset_config {
  data_sources {
    tfrecords_path: "/workspace/tlt-experiments/data/tfrecords/kitti_trainval/*"
    image_directory_path: "/workspace/tlt-experiments/data/training"
  }
  image_extension: "png"
  target_class_mapping {
    key: "low"
    value: "low"
  }
  target_class_mapping {
    key: "medium"
    value: "medium"
  }
  target_class_mapping {
    key: "high"
    value: "high"
  }
  validation_fold: 0
}
augmentation_config {
  preprocessing {
    output_image_width: 704
    output_image_height: 480
    min_bbox_width: 1.0
    min_bbox_height: 1.0
    output_image_channel: 3
  }
  spatial_augmentation {
    hflip_probability: 0.0
    zoom_min: 1.0
    zoom_max: 1.0
    translate_max_x: 0.0
    translate_max_y: 0.0
  }
  color_augmentation {
    hue_rotation_max: 25.0
    saturation_shift_max: 0.2
    contrast_scale_max: 0.1
    contrast_center: 0.5
  }
}
postprocessing_config {
  target_class_config {
    key: "low"
    value {
      clustering_config {
        coverage_threshold: 0.00499999988824
        dbscan_eps: .7
        dbscan_min_samples: 0.0500000007451
        minimum_bounding_box_height: 75
      }
    }
  }
  target_class_config {
    key: "medium"
    value {
      clustering_config {
        coverage_threshold: 0.00499999988824
        dbscan_eps: .7
        dbscan_min_samples: 0.0500000007451
        minimum_bounding_box_height: 75
      }
    }
  }
  target_class_config {
    key: "high"
    value {
      clustering_config {
        coverage_threshold: 0.00749999983236
        dbscan_eps: .7
        dbscan_min_samples: 0.0500000007451
        minimum_bounding_box_height: 75
      }
    }
  }
}
model_config {
  pretrained_model_file: "/workspace/tlt-experiments/detectnet_v2/pretrained_resnet18/tlt_pretrained_detectnet_v2_vresnet18/resnet18.hdf5"
  num_layers: 18
  use_batch_norm: true
  objective_set {
    bbox {
      scale: 35.0
      offset: 0.5
    }
    cov {
    }
  }
  training_precision {
    backend_floatx: FLOAT32
  }
  arch: "resnet"
}
evaluation_config {
  validation_period_during_training: 10
  first_validation_epoch: 30
  minimum_detection_ground_truth_overlap {
    key: "low"
    value: 0.80
  }
  minimum_detection_ground_truth_overlap {
    key: "medium"
    value: 0.80
  }
  minimum_detection_ground_truth_overlap {
    key: "high"
    value: 0.80
  }
  evaluation_box_config {
    key: "low"
    value {
      minimum_height: 10
      maximum_height: 9999
      minimum_width: 10
      maximum_width: 9999
    }
  }
  evaluation_box_config {
    key: "medium"
    value {
      minimum_height: 10
      maximum_height: 9999
      minimum_width: 10
      maximum_width: 9999
    }
  }
  evaluation_box_config {
    key: "high"
    value {
      minimum_height: 10
      maximum_height: 9999
      minimum_width: 10
      maximum_width: 9999
    }
  }
  average_precision_mode: INTEGRATE
}
cost_function_config {
  target_classes {
    name: "low"
    class_weight: 2.22
    coverage_foreground_weight: 0.0500000007451
    objectives {
      name: "cov"
      initial_weight: 1.0
      weight_target: 1.0
    }
    objectives {
      name: "bbox"
      initial_weight: 10.0
      weight_target: 10.0
    }
  }
  target_classes {
    name: "medium"
    class_weight: 3.49
    coverage_foreground_weight: 0.0500000007451
    objectives {
      name: "cov"
      initial_weight: 1.0
      weight_target: 1.0
    }
    objectives {
      name: "bbox"
      initial_weight: 10.0
      weight_target: 10.0
    }
  }
  target_classes {
    name: "high"
    class_weight: 3.77
    coverage_foreground_weight: 0.0500000007451
    objectives {
      name: "cov"
      initial_weight: 1.0
      weight_target: 1.0
    }
    objectives {
      name: "bbox"
      initial_weight: 10.0
      weight_target: 10.0
    }
  }
  enable_autoweighting: true
  max_objective_weight: 0.999899983406
  min_objective_weight: 9.99999974738e-05
}
training_config {
  batch_size_per_gpu: 4
  num_epochs: 120
  learning_rate {
    soft_start_annealing_schedule {
      min_learning_rate: 5e-06
      max_learning_rate: 5e-04
      soft_start: 0.10000000149
      annealing: 0.699999988079
    }
  }
  regularizer {
    type: L1
    weight: 3.00000002618e-09
  }
  optimizer {
    adam {
      epsilon: 9.99999993923e-09
      beta1: 0.899999976158
      beta2: 0.999000012875
    }
  }
  cost_scaling {
    initial_exponent: 20.0
    increment: 0.005
    decrement: 1.0
  }
  checkpoint_interval: 10
}
bbox_rasterizer_config {
  target_class_config {
    key: "low"
    value {
      cov_center_x: 0.5
      cov_center_y: 0.5
      cov_radius_x: 1.0
      cov_radius_y: 1.0
      bbox_min_radius: 1.0
    }
  }
  target_class_config {
    key: "medium"
    value {
      cov_center_x: 0.5
      cov_center_y: 0.5
      cov_radius_x: 1.0
      cov_radius_y: 1.0
      bbox_min_radius: 1.0
    }
  }
  target_class_config {
    key: "high"
    value {
      cov_center_x: 0.5
      cov_center_y: 0.5
      cov_radius_x: 1.0
      cov_radius_y: 1.0
      bbox_min_radius: 1.0
    }
  }
  deadzone_radius: 0.400000154972
}

And here is my new inference spec sheet:


inferencer_config{
  # defining target class names for the experiment.
  # Note: This must be mentioned in order of the networks classes.
  target_classes: "low"
  target_classes: "medium"
  target_classes: "high"
  # Inference dimensions.
  image_width: 704
  image_height: 480
  # Must match what the model was trained for.
  image_channels: 3
  batch_size: 8
  gpu_index: 0
  # model handler config
  tlt_config{
    model: "/workspace/tlt-experiments/detectnet_v2/experiment_dir_retrain/weights/resnet18_detector_pruned.tlt"
  }
}
bbox_handler_config{
  kitti_dump: true
  disable_overlay: false
  overlay_linewidth: 2
  classwise_bbox_handler_config{
    key:"low"
    value: {
      confidence_model: "mean_cov"
      output_map: "low"
      confidence_threshold: 0.4
      bbox_color{
        R: 255
        G: 0
        B: 0
      }
      clustering_config{
        coverage_threshold: 0.00
        dbscan_eps: .7
        dbscan_min_samples: 0.05
        minimum_bounding_box_height: 4
      }
    }
  }
  classwise_bbox_handler_config{
    key:"medium"
    value: {
      confidence_model: "mean_cov"
      output_map: "medium"
      confidence_threshold: 0.4
      bbox_color{
        R: 0
        G: 255
        B: 0
      }
      clustering_config{
        coverage_threshold: 0.00
        dbscan_eps: .7
        dbscan_min_samples: 0.05
        minimum_bounding_box_height: 4
      }
    }
  }
  classwise_bbox_handler_config{
    key:"high"
    value: {
      confidence_model: "mean_cov"
      output_map: "high"
      confidence_threshold: 0.4
      bbox_color{
        R: 0
        G: 0
        B: 255
      }
      clustering_config{
        coverage_threshold: 0.00
        dbscan_eps: .7
        dbscan_min_samples: 0.05
        minimum_bounding_box_height: 4
      }
    }
  }
  classwise_bbox_handler_config{
    key:"default"
    value: {
      confidence_model: "mean_cov"
      confidence_threshold: 0.4
      bbox_color{
        R: 255
        G: 255
        B: 0
      }
      clustering_config{
        coverage_threshold: 0.00
        dbscan_eps: .7
        dbscan_min_samples: 0.05
        minimum_bounding_box_height: 4
      }
    }
  }
}

.4 confidence is the best the model could achieve and even then there were images that were not annotated. Is this a case of a too small data set or is there something I can do to improve the parameters? I’m at a loss as to why the model can’t achieve better performance when the mean average precision is so high. Looking forward to hearing your perspective on this.

Firstly, could you attach one screenshot when you mentioned “it still performs badly”?

More, from your training spec, I can see the minumum_bounding_box_height is set to 75. How about your average weight or height of your training images’ objects? If a bbox’s height is lower than 75, the bbox will be ignored. So, please set a smaller minumum_bounding_box_height, and run tlt-evaluate directly to check how many mAP your model can get.