Unable to detect object after training

Hi,
I have train my model with resnet18 for single class(person). training file detectnet_v2_train_resnet18_kitti.txt is:

random_seed: 42
dataset_config {
  data_sources {
    tfrecords_path: "/workspace/tlt-experiments/tfrecords/kitti_trainval/*"
    image_directory_path: "/workspace/tlt-experiments/data/training"
  }
  image_extension: "jpg"
  target_class_mapping {
    key: "person"
    value: "person"
  }
  validation_fold: 0
}
augmentation_config {
  preprocessing {
    output_image_width: 1280
    output_image_height: 720
    min_bbox_width: 1.0
    min_bbox_height: 1.0
    output_image_channel: 3
  }
  spatial_augmentation {
    hflip_probability: 0.5
    zoom_min: 1.0
    zoom_max: 1.0
    translate_max_x: 8.0
    translate_max_y: 8.0
  }
  color_augmentation {
    hue_rotation_max: 25.0
    saturation_shift_max: 0.20000000298
    contrast_scale_max: 0.10000000149
    contrast_center: 0.5
  }
}
postprocessing_config {
  target_class_config {
    key: "person"
    value {
      clustering_config {
        coverage_threshold: 0.00499999988824
        dbscan_eps: 0.20000000298
        dbscan_min_samples: 0.0500000007451
        minimum_bounding_box_height: 20
      }
    }
  }
}
model_config {
  pretrained_model_file: "/workspace/tlt-experiments/pretrained_resnet18/tlt_resnet18_detectnet_v2_v1/resnet18.hdf5"
  num_layers: 18
  use_batch_norm: true
  activation {
    activation_type: "relu"
  }
  objective_set {
    bbox {
      scale: 35.0
      offset: 0.5
    }
    cov {
    }
  }
  training_precision {
    backend_floatx: FLOAT32
  }
  arch: "resnet"
}
evaluation_config {
  validation_period_during_training: 10
  first_validation_epoch: 1
  minimum_detection_ground_truth_overlap {
    key: "person"
    value: 0.699999988079
  }
  evaluation_box_config {
    key: "person"
    value {
      minimum_height: 20
      maximum_height: 9999
      minimum_width: 10
      maximum_width: 9999
    }
  }
  average_precision_mode: INTEGRATE
}
cost_function_config {
  target_classes {
    name: "person"
    class_weight: 1.0
    coverage_foreground_weight: 0.0500000007451
    objectives {
      name: "cov"
      initial_weight: 1.0
      weight_target: 1.0
    }
    objectives {
      name: "bbox"
      initial_weight: 10.0
      weight_target: 10.0
    }
  }
  enable_autoweighting: true
  max_objective_weight: 0.999899983406
  min_objective_weight: 9.99999974738e-05
}
training_config {
  batch_size_per_gpu: 12
  num_epochs: 3500
  learning_rate {
    soft_start_annealing_schedule {
      min_learning_rate: 5e-06
      max_learning_rate: 5e-04
      soft_start: 0.10000000149
      annealing: 0.699999988079
    }
  }
  regularizer {
    type: L1
    weight: 3.00000002618e-09
  }
  optimizer {
    adam {
      epsilon: 9.99999993923e-09
      beta1: 0.899999976158
      beta2: 0.999000012875
    }
  }
  cost_scaling {
    initial_exponent: 20.0
    increment: 0.005
    decrement: 1.0
  }
  checkpoint_interval: 10
}
bbox_rasterizer_config {
  target_class_config {
    key: "person"
    value {
      cov_center_x: 0.5
      cov_center_y: 0.5
      cov_radius_x: 0.40000000596
      cov_radius_y: 0.40000000596
      bbox_min_radius: 1.0
    }
  }
  deadzone_radius: 0.400000154972
}

logs after training completed :

INFO:tensorflow:Saving checkpoints for step-399000.
2020-02-02 14:23:51,160 [INFO] tensorflow: Saving checkpoints for step-399000.
2020-02-02 14:23:51,496 [INFO] iva.detectnet_v2.evaluation.evaluation: step 0 / 18, 0.00s/step
2020-02-02 14:23:53,434 [INFO] iva.detectnet_v2.evaluation.evaluation: step 10 / 18, 0.19s/step
Matching predictions to ground truth, class 1/1.: 100%|#| 319/319 [00:00<00:00, 9364.71it/s]
Epoch 3500/3500
=========================

Validation cost: 0.000086
Mean average_precision (in %): 82.3062

class name      average precision (in %)
------------  --------------------------
person                           82.3062

Median Inference Time: 0.013935
2020-02-02 14:23:55,099 [INFO] /usr/local/lib/python2.7/dist-packages/iva/detectnet_v2/tfhooks/sample_counter_hook.pyc: Samples / sec: 21.359
Time taken to run iva.detectnet_v2.scripts.train:main: 1 day, 23:34:05.290584.

json file detectnet_v2_clusterfile_kitti.json for inference is :

{
    "dbscan_criterion": "IOU",
    "dbscan_eps": {
        "person": 0.25,
        "default": 0.15
    },
    "dbscan_min_samples": {
        "person": 0.05,
        "default": 0.0
    },
    "min_cov_to_cluster": {
        "person": 0.005,
        "default": 0.005
    },
    "min_obj_height": {
        "person": 4,
        "default": 2
    },
    "target_classes": ["person"],
    "confidence_th": {
        "person": 0.8
    },
    "confidence_model": {
        "person": { "kind": "aggregate_cov"},
        "default": { "kind": "aggregate_cov"}
    },
    "output_map": {
        "person" : "person"
    },
    "color": {
        "person": "green",
        "default": "blue"
    },
    "postproc_classes": ["perosn"],
    "image_height": 720,
    "image_width": 720,
    "stride": 16
}

I am not able to see even single B-BOX on test images. and also i have generated file and tested the resnet18_detector.etlt
calibration.bin
calibration.tensor
on DS-3 but not getting B-BOX on a single frames.
please help where i am wrong.
I had also train model using resnet18 last time but last time i was getting result and that time training precision was 58 but this time precision is 82 + but not getting result.

Thanks.

Hi pritam,
I saw your training spec

output_image_width: 1280
 output_image_height: 720

But in your json file detectnet_v2_clusterfile_kitti.json for inference

"image_height": 720,
  "image_width": 720,

Could you plese keep align and check again?

Actually morganh I had also tested on 1280720 but when i was not getting result so i was changing this with 720720 or other.

But if we keep json aside so it should work on DS but there also it is not working as not detecting any thing.

Firstly, you should try “tlt-infer” command instead of DS.
You can run “tlt-infer” to see if you can get the BBOX for the images.

Yes I tried that but in folder tlt_infer_testing all the images were without B-Box.

Could you please check your tlt-infer command again?

Also, could you please change confidence_th and try?

Yes Morgan I have tried it now.

# Running inference for detection on n images
!tlt-infer detectnet_v2 -i $USER_EXPERIMENT_DIR/data/testing/image_2 \
                        -o $USER_EXPERIMENT_DIR/tlt_infer_testing \
                        -m $USER_EXPERIMENT_DIR/experiment_dir_pruned/resnet18_nopool_bn_detectnet_v2_pruned.tlt \
                        -cp $SPECS_DIR/detectnet_v2_clusterfile_kitti.json \
                        -k $KEY \
                        --kitti_dump \
                        -lw 3 \
                        -g 0 \
                        -bs 64

but getting nothing, No B-box.
Is there is problem with training or something else.

Yes I changed confidence_th = 0.8 to 0.5 but same issue.

Could you please change the images folder you want to do inference?
Just change below folder to the one which you used for training.

$USER_EXPERIMENT_DIR/data/testing/image_2

Yes I did this also but not getting result even on the training images.

Also need to check if your $USER_EXPERIMENT_DIR/experiment_dir_pruned/resnet18_nopool_bn_detectnet_v2_pruned.tlt

is the exact model you just train and get 82% MAP.

yes it is the same.
I had also doubt about it so I did pruning thrice and check but same issue.

When you mention 82%, did you run training or retraining?

It was training.

OK, so please check your training log and use the exact tlt model you have trained to do inference.
By default, the tlt model should be experiment_dir_unpruned/weights/resnet18_detector.tlt

Please try with it again.

If you do tlt-prune, you will get $USER_EXPERIMENT_DIR/experiment_dir_pruned/resnet18_nopool_bn_detectnet_v2_pruned.tlt .
But as mentioned in tlt user guide, it is necessary to run retraining against the pruned tlt mdoel.

Ok morganh I will retrain model on pruned weight as mention in the detectnet_v2_retrain_resnet18_kitti.txt file but i have one concern is that last time when i had train my model it was giving detection result and I did not retrain the model but now not, why?? i am not getting this. if any other clue you will find then please let me know.
I am very confused.

Morganh actually I had to know one thing that when the training was running I saw that on epoch 501 I was getting MAP ~88.0 but on epoch 601 I was getting MAP ~72.0 and epoch 35000 MAP was ~82.0 why ?? Means can we use the weight (model.step-41040.tlt (epoch 501 ~step 41040)) as a experiment_dir_unpruned/weights/resnet18_detector.tlt weight and prune that and test ?

Hi pritam,
You can quickly do tlt-infer against the model you already generated.
experiment_dir_unpruned/weights/resnet18_detector.tlt

It is an unpruned tlt model. And as you mentioned, it can reach 82% mAP.
During training, it is common for mAP to get fluctuated

Hi morganh,
Actually I am getting good result using tlt-infer for some images but then get error like

File "/usr/local/bin/tlt-infer", line 8, in <module>
    sys.exit(main())
  File "./common/magnet_infer.py", line 35, in main
  File "./detectnet_v2/scripts/inference.py", line 222, in main
  File "./detectnet_v2/scripts/inference.py", line 180, in inference_wrapper_batch
  File "./detectnet_v2/inferencer/tlt_inferencer.py", line 123, in infer_batch
  File "./detectnet_v2/inferencer/base_inferencer.py", line 107, in input_preprocessing
ValueError: axes don't match array
77it [01:39,  1.29s/it]

I have seen your answer from https://devtalk.nvidia.com/default/topic/1067152/transfer-learning-toolkit/valueerror-axes-don-t-match-array/post/5412235/#5412235

For batch 64 getting less number of output in tlt-infer-testing folder

# Running inference for detection on n images
!tlt-infer detectnet_v2 -i $USER_EXPERIMENT_DIR/data/training/image_2 \
                        -o $USER_EXPERIMENT_DIR/tlt_infer_testing \
                        -m $USER_EXPERIMENT_DIR/experiment_dir_retrain/weights/resnet18_detector_pruned.tlt \
                        -cp $SPECS_DIR/detectnet_v2_clusterfile_kitti.json \
                        -k $KEY \
                        --kitti_dump \
                        -lw 3 \
                        -g 0 \
                        -bs 64

For batch 32 i am getting more output sample compare to 64 then get error mention above

# Running inference for detection on n images
!tlt-infer detectnet_v2 -i $USER_EXPERIMENT_DIR/data/training/image_2 \
                        -o $USER_EXPERIMENT_DIR/tlt_infer_testing \
                        -m $USER_EXPERIMENT_DIR/experiment_dir_retrain/weights/resnet18_detector_pruned.tlt \
                        -cp $SPECS_DIR/detectnet_v2_clusterfile_kitti.json \
                        -k $KEY \
                        --kitti_dump \
                        -lw 3 \
                        -g 0 \
                        -bs 32

For batch 16 i am getting more output sample compare to 32 then get error mention above

# Running inference for detection on n images
!tlt-infer detectnet_v2 -i $USER_EXPERIMENT_DIR/data/training/image_2 \
                        -o $USER_EXPERIMENT_DIR/tlt_infer_testing \
                        -m $USER_EXPERIMENT_DIR/experiment_dir_retrain/weights/resnet18_detector_pruned.tlt \
                        -cp $SPECS_DIR/detectnet_v2_clusterfile_kitti.json \
                        -k $KEY \
                        --kitti_dump \
                        -lw 3 \
                        -g 0 \
                        -bs 16

and all images are of the same format.