Accelerating Peoplnet with tlt for jetson nano

Hello,

I’m trying to retrain a PeopleNet model using custom data. I followed the guidelines in https://devblogs.nvidia.com/training-custom-pretrained-models-using-tlt/, I succesfully trained and tested on jetson nano deepstream my retrained Peoplenet. In the case of deepstream sample of Peoplenet

$ /opt/nvidia/deepstream/deepstream-5.0/samples/configs/tlt_pretrained_models/deepstream_app_source1_peoplenet.txt

it took 10 fps, but none of my trained versions reached that fps score.

I re-trained unpruned Peoplenet version with 34 layouts and it took 2,5fps. When I changed 34 layers to 18 (propsed in nvc sample of peoplenet), i got 4,5 fps.
Even when i used pruned version and only converted it and proced with deepstream, i got only 6 fps.

I know that pruning process optimalizes networks but how low does the threshold need to be to get to be so fast? Or it could be due to resolution I passed (1280x720), but I don’t think so.

Could you explain to me how to make my retrain Peoplent model get 10fps on jetson nano?

Firstly, have you boosted the clocks?

$ sudo -s nvpmodel -m 0

$ jetson_clocks

Yes I boosted the clocks, but still my tlt-peoplnet fps did not increased

Hi @vladimir.zaigrajew
Actually i am also trying to re-train peoplenet model but i am getting illegal instruction core dumped error. i already raised a question in forum. please find below the link for that:

and one thing i want to know how to deploy and test pruned model on jetson nano for inferencing.
Can you please help me in that. Thanks in Advance:)

@rajaneconsys
For “illegal instruction” error, please try another host PC if possible.

@vladimir.zaigrajew
For peoplenet, it is trained with 960 x 544 and ResNet34.
If possible, you can resize your images/labels to 960x544 and retry.
More, pruning plays an important role. Please check the released pruned peoplenet model size and try to prune your current model to similar size model. Remember to retrain after pruning.

I trained peoplenet with my dataset 960x544 and ResNet34 and on jetson I got even worse performance 3.8-4.0 fps.
I pruned with pth 0.005 (3.8 fps) and 0.5 (4.0 fps). My specs file are almost the same as in the https://devblogs.nvidia.com/training-custom-pretrained-models-using-tlt/

random_seed: 42

dataset_config {
data_sources: {
  tfrecords_path: "/workspace/tlt-experiments/data_resolution/tfrecords/kitti_trainval/kitti_trainval*"
   image_directory_path: "/workspace/tlt-experiments/data_resolution/training"
}
image_extension: "png"
  target_class_mapping {
      key: "emplo"
      value: "emplo"
  }
validation_fold: 0
}


model_config {
  pretrained_model_file: "/workspace/tlt-experiments/people_milk/pretrained_peoplenet/tlt_peoplenet_vunpruned_v1.0/resnet34_peoplenet.tlt"
  num_layers: 34
  freeze_blocks: 0
  arch: "resnet"
  use_batch_norm: true
  objective_set {
bbox {
  scale: 35.0
  offset: 0.5
}
cov {
}
  }
  training_precision {
backend_floatx: FLOAT32
  }
}

training_config {
batch_size_per_gpu: 12
num_epochs: 40
learning_rate {
  soft_start_annealing_schedule {
min_learning_rate: 5e-06
max_learning_rate: 0.0005
soft_start: 0.1
annealing: 0.7
  }
}
regularizer {
  type: L1
  weight: 3e-09
}
optimizer {
  adam {
epsilon: 9.9e-09
beta1: 0.9
beta2: 0.999
  }
}
cost_scaling {
  initial_exponent: 20.0
  increment: 0.005
  decrement: 1.0
}
checkpoint_interval: 10
}

augmentation_config {
 preprocessing {
 output_image_width: 960
 output_image_height: 544
 output_image_channel: 3
 crop_right: 960
 crop_bottom: 544
 min_bbox_width: 1.0
 min_bbox_height: 1.0
 }
 spatial_augmentation {
   hflip_probability: 0.5
   zoom_min: 1.0
   zoom_max: 1.0
   translate_max_x: 8.0
   translate_max_y: 8.0
 }
 color_augmentation {
   hue_rotation_max: 25.0
   saturation_shift_max: 0.20000000298
   contrast_scale_max: 0.10000000149
   contrast_center: 0.5
 }
}

cost_function_config {
  target_classes {
name: "emplo"
class_weight: 1.0
coverage_foreground_weight: 0.0500000007451
objectives {
  name: "cov"
  initial_weight: 1.0
  weight_target: 1.0
}
objectives {
  name: "bbox"
  initial_weight: 10.0
  weight_target: 10.0
}
  }
  enable_autoweighting: true
  max_objective_weight: 0.999899983406
  min_objective_weight: 9.99999974738e-05
}

bbox_rasterizer_config {
  target_class_config {
key: "emplo"
value {
  cov_center_x: 0.5
  cov_center_y: 0.5
  cov_radius_x: 0.40000000596
  cov_radius_y: 0.40000000596
  bbox_min_radius: 1.0
}
  }
  deadzone_radius: 0.400000154972
}

postprocessing_config{
 target_class_config{
   key: "emplo"
   value: {
 clustering_config {
   coverage_threshold: 0.005
   dbscan_eps: 0.265
   dbscan_min_samples: 0.05
   minimum_bounding_box_height: 4
 }
   }
 }
}


evaluation_config {
 validation_period_during_training: 10
 first_validation_epoch: 10
 minimum_detection_ground_truth_overlap {
   key: "emplo"
   value: 0.5
 }
 evaluation_box_config {
   key: "emplo"
   value {
 minimum_height: 4
 maximum_height: 9999
 minimum_width: 4
 maximum_width: 9999
   }
 }
}

I find out that pruned version from ngc is working on 10.5 fps, so something must be wrong with training or pruning

What is the size of your trained pruned tlt model? Please run “ll -sh xxx”

And also could you please paste the full log when you run tlt-prune?

The fps always stay the same

How about the fps if you prune more?
More, could you please share how did you test fps?

For your reference, I run below experiment. It shows

  1. the fps of unpruned tlt model : about 3.9fps
  2. the fps of pruned_0.005 tlt model : about 10fps

root@1240072:/workspace/tlt-experiments# cat tlt-export_fp16_unprune.sh
tlt-export detectnet_v2 -m resnet34_peoplenet.tlt
-k tlt_encode
-o resnet34_peoplenet.etlt
–data_type fp16

root@1240072:/workspace/tlt-experiments# sh tlt-export_fp16_unpruned.sh
Using TensorFlow backend.
NOTE: UFF has been tested with TensorFlow 1.14.0.
WARNING: The version of TensorFlow installed on this system is not guaranteed to work with UFF.
2020-06-15 16:11:10,726 [INFO] modulus.export._uff: Modulus patch identity layer in padding inputs.
2020-06-15 16:11:15,902 [INFO] modulus.export._uff: Modulus patch identity layer in padding inputs.
DEBUG [/usr/lib/python2.7/dist-packages/uff/converters/tensorflow/converter.py:96] Marking [‘output_cov/Sigmoid’, ‘output_bbox/BiasAdd’] as outputs
[TensorRT] INFO: Some tactics do not have sufficient workspace memory to run. Increasing workspace size may increase performance, please check verbose output.
[TensorRT] INFO: Detected 1 inputs and 2 output network tensors.

root@1240072:/workspace/tlt-experiments# cat prune_0.005.sh
tlt-prune -m resnet34_peoplenet.tlt -o resnet34_peoplenet_0.005.tlt -pth 0.005 -eq union -k tlt_encode

root@1240072:/workspace/tlt-experiments# sh prune_0.005.sh
Using TensorFlow backend.
2020-06-15 16:18:48,488 [INFO] modulus.pruning.pruning: Exploring graph for retainable indices
2020-06-15 16:18:50,906 [INFO] modulus.pruning.pruning: Pruning model and appending pruned nodes to new graph
2020-06-15 16:20:55,067 [INFO] iva.common.magnet_prune: Pruning ratio (pruned model / original model): 0.143708735585

root@1240072:/workspace/tlt-experiments# cat tlt-export_fp16_0.005.sh
tlt-export detectnet_v2 -m resnet34_peoplenet_0.005.tlt
-k tlt_encode
-o resnet34_peoplenet_0.005.etlt
–data_type fp16

root@1240072:/workspace/tlt-experiments# sh tlt-export_fp16_0.005.sh
Using TensorFlow backend.
NOTE: UFF has been tested with TensorFlow 1.14.0.
WARNING: The version of TensorFlow installed on this system is not guaranteed to work with UFF.
2020-06-15 16:24:57,321 [INFO] modulus.export._uff: Modulus patch identity layer in padding inputs.
2020-06-15 16:25:02,810 [INFO] modulus.export._uff: Modulus patch identity layer in padding inputs.
DEBUG [/usr/lib/python2.7/dist-packages/uff/converters/tensorflow/converter.py:96] Marking [‘output_cov/Sigmoid’, ‘output_bbox/BiasAdd’] as outputs
[TensorRT] INFO: Some tactics do not have sufficient workspace memory to run. Increasing workspace size may increase performance, please check verbose output.
[TensorRT] INFO: Detected 1 inputs and 2 output network tensors.

In nano,

$ cat generate_fp16_engine_unpruned.sh
tlt-converter resnet34_peoplenet.etlt
-k tlt_encode
-o output_bbox/BiasAdd,output_cov/Sigmoid
-d 3,544,960
-i nchw
-m 8 -t fp16
-e resnet34_peoplenet.engine
-b 64
-w 1000000000

$ sh generate_fp16_engine_unpruned.sh

$ /usr/src/tensorrt/bin/trtexec --loadEngine=resnet34_peoplenet.engine --fp16 --batch=1 --useSpinWait
&&&& RUNNING TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --loadEngine=resnet34_peoplenet.engine --fp16 --batch=1 --useSpinWait
[06/16/2020-01:34:41] [I] === Model Options ===
[06/16/2020-01:34:41] [I] Format: *
[06/16/2020-01:34:41] [I] Model:
[06/16/2020-01:34:41] [I] Output:
[06/16/2020-01:34:41] [I] === Build Options ===
[06/16/2020-01:34:41] [I] Max batch: 1
[06/16/2020-01:34:41] [I] Workspace: 16 MB
[06/16/2020-01:34:41] [I] minTiming: 1
[06/16/2020-01:34:41] [I] avgTiming: 8
[06/16/2020-01:34:41] [I] Precision: FP32+FP16
[06/16/2020-01:34:41] [I] Calibration:
[06/16/2020-01:34:41] [I] Safe mode: Disabled
[06/16/2020-01:34:41] [I] Save engine:
[06/16/2020-01:34:41] [I] Load engine: resnet34_peoplenet.engine
[06/16/2020-01:34:41] [I] Builder Cache: Enabled
[06/16/2020-01:34:41] [I] NVTX verbosity: 0
[06/16/2020-01:34:41] [I] Inputs format: fp32:CHW
[06/16/2020-01:34:41] [I] Outputs format: fp32:CHW
[06/16/2020-01:34:41] [I] Input build shapes: model
[06/16/2020-01:34:41] [I] Input calibration shapes: model
[06/16/2020-01:34:41] [I] === System Options ===
[06/16/2020-01:34:41] [I] Device: 0
[06/16/2020-01:34:41] [I] DLACore:
[06/16/2020-01:34:41] [I] Plugins:
[06/16/2020-01:34:41] [I] === Inference Options ===
[06/16/2020-01:34:41] [I] Batch: 1
[06/16/2020-01:34:41] [I] Input inference shapes: model
[06/16/2020-01:34:41] [I] Iterations: 10
[06/16/2020-01:34:41] [I] Duration: 3s (+ 200ms warm up)
[06/16/2020-01:34:41] [I] Sleep time: 0ms
[06/16/2020-01:34:41] [I] Streams: 1
[06/16/2020-01:34:41] [I] ExposeDMA: Disabled
[06/16/2020-01:34:41] [I] Spin-wait: Enabled
[06/16/2020-01:34:41] [I] Multithreading: Disabled
[06/16/2020-01:34:41] [I] CUDA Graph: Disabled
[06/16/2020-01:34:41] [I] Skip inference: Disabled
[06/16/2020-01:34:41] [I] Inputs:
[06/16/2020-01:34:41] [I] === Reporting Options ===
[06/16/2020-01:34:41] [I] Verbose: Disabled
[06/16/2020-01:34:41] [I] Averages: 10 inferences
[06/16/2020-01:34:41] [I] Percentile: 99
[06/16/2020-01:34:41] [I] Dump output: Disabled
[06/16/2020-01:34:41] [I] Profile: Disabled
[06/16/2020-01:34:41] [I] Export timing to JSON file:
[06/16/2020-01:34:41] [I] Export output to JSON file:
[06/16/2020-01:34:41] [I] Export profile to JSON file:
[06/16/2020-01:34:41] [I]
[06/16/2020-01:34:46] [I] Starting inference threads
[06/16/2020-01:34:50] [I] Warmup completed 1 queries over 200 ms
[06/16/2020-01:34:50] [I] Timing trace has 14 queries over 3.58136 s
[06/16/2020-01:34:50] [I] Trace averages of 10 runs:
[06/16/2020-01:34:50] [I] Average on 10 runs - GPU latency: 255.143 ms - Host latency: 255.79 ms (end to end 255.796 ms)
[06/16/2020-01:34:50] [I] Host latency
[06/16/2020-01:34:50] [I] min: 255.455 ms (end to end 255.461 ms)
[06/16/2020-01:34:50] [I] max: 255.986 ms (end to end 255.992 ms)
[06/16/2020-01:34:50] [I] mean: 255.805 ms (end to end 255.811 ms)
[06/16/2020-01:34:50] [I] median: 255.826 ms (end to end 255.833 ms)
[06/16/2020-01:34:50] [I] percentile: 255.986 ms at 99% (end to end 255.992 ms at 99%)
[06/16/2020-01:34:50] [I] throughput: 3.90913 qps
[06/16/2020-01:34:50] [I] walltime: 3.58136 s
[06/16/2020-01:34:50] [I] GPU Compute
[06/16/2020-01:34:50] [I] min: 254.812 ms
[06/16/2020-01:34:50] [I] max: 255.337 ms
[06/16/2020-01:34:50] [I] mean: 255.159 ms
[06/16/2020-01:34:50] [I] median: 255.178 ms
[06/16/2020-01:34:50] [I] percentile: 255.337 ms at 99%
[06/16/2020-01:34:50] [I] total compute time: 3.57223 s
&&&& PASSED TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --loadEngine=resnet34_peoplenet.engine --fp16 --batch=1 --useSpinWait

==> the fps is about 3.9

$ cat generate_fp16_engine_0.005.sh
tlt-converter-7.1 resnet34_peoplenet_0.005.etlt
-k tlt_encode
-o output_bbox/BiasAdd,output_cov/Sigmoid
-d 3,544,960
-i nchw
-m 8 -t fp16
-e resnet34_peoplenet_0.005.engine
-b 64
-w 1000000000

$ sh generate_fp16_engine_0.005.sh

$ /usr/src/tensorrt/bin/trtexec --loadEngine=resnet34_peoplenet_0.005.engine --fp16 --batch=1 --useSpinWait
&&&& RUNNING TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --loadEngine=resnet34_peoplenet_0.005.engine --fp16 --batch=1 --useSpinWait
[06/16/2020-01:57:53] [I] === Model Options ===
[06/16/2020-01:57:53] [I] Format: *
[06/16/2020-01:57:53] [I] Model:
[06/16/2020-01:57:53] [I] Output:
[06/16/2020-01:57:53] [I] === Build Options ===
[06/16/2020-01:57:53] [I] Max batch: 1
[06/16/2020-01:57:53] [I] Workspace: 16 MB
[06/16/2020-01:57:53] [I] minTiming: 1
[06/16/2020-01:57:53] [I] avgTiming: 8
[06/16/2020-01:57:53] [I] Precision: FP32+FP16
[06/16/2020-01:57:53] [I] Calibration:
[06/16/2020-01:57:53] [I] Safe mode: Disabled
[06/16/2020-01:57:53] [I] Save engine:
[06/16/2020-01:57:53] [I] Load engine: resnet34_peoplenet_0.005.engine
[06/16/2020-01:57:53] [I] Builder Cache: Enabled
[06/16/2020-01:57:53] [I] NVTX verbosity: 0
[06/16/2020-01:57:53] [I] Inputs format: fp32:CHW
[06/16/2020-01:57:53] [I] Outputs format: fp32:CHW
[06/16/2020-01:57:53] [I] Input build shapes: model
[06/16/2020-01:57:53] [I] Input calibration shapes: model
[06/16/2020-01:57:53] [I] === System Options ===
[06/16/2020-01:57:53] [I] Device: 0
[06/16/2020-01:57:53] [I] DLACore:
[06/16/2020-01:57:53] [I] Plugins:
[06/16/2020-01:57:53] [I] === Inference Options ===
[06/16/2020-01:57:53] [I] Batch: 1
[06/16/2020-01:57:53] [I] Input inference shapes: model
[06/16/2020-01:57:53] [I] Iterations: 10
[06/16/2020-01:57:53] [I] Duration: 3s (+ 200ms warm up)
[06/16/2020-01:57:53] [I] Sleep time: 0ms
[06/16/2020-01:57:53] [I] Streams: 1
[06/16/2020-01:57:53] [I] ExposeDMA: Disabled
[06/16/2020-01:57:53] [I] Spin-wait: Enabled
[06/16/2020-01:57:53] [I] Multithreading: Disabled
[06/16/2020-01:57:53] [I] CUDA Graph: Disabled
[06/16/2020-01:57:53] [I] Skip inference: Disabled
[06/16/2020-01:57:53] [I] Inputs:
[06/16/2020-01:57:53] [I] === Reporting Options ===
[06/16/2020-01:57:53] [I] Verbose: Disabled
[06/16/2020-01:57:53] [I] Averages: 10 inferences
[06/16/2020-01:57:53] [I] Percentile: 99
[06/16/2020-01:57:53] [I] Dump output: Disabled
[06/16/2020-01:57:53] [I] Profile: Disabled
[06/16/2020-01:57:53] [I] Export timing to JSON file:
[06/16/2020-01:57:53] [I] Export output to JSON file:
[06/16/2020-01:57:53] [I] Export profile to JSON file:
[06/16/2020-01:57:53] [I]
[06/16/2020-01:57:57] [I] Starting inference threads
[06/16/2020-01:58:01] [I] Warmup completed 2 queries over 200 ms
[06/16/2020-01:58:01] [I] Timing trace has 32 queries over 3.22652 s
[06/16/2020-01:58:01] [I] Trace averages of 10 runs:
[06/16/2020-01:58:01] [I] Average on 10 runs - GPU latency: 100.122 ms - Host latency: 100.763 ms (end to end 100.769 ms)
[06/16/2020-01:58:01] [I] Average on 10 runs - GPU latency: 100.161 ms - Host latency: 100.803 ms (end to end 100.809 ms)
[06/16/2020-01:58:01] [I] Average on 10 runs - GPU latency: 100.253 ms - Host latency: 100.896 ms (end to end 100.902 ms)
[06/16/2020-01:58:01] [I] Host latency
[06/16/2020-01:58:01] [I] min: 100.557 ms (end to end 100.562 ms)
[06/16/2020-01:58:01] [I] max: 102.077 ms (end to end 102.084 ms)
[06/16/2020-01:58:01] [I] mean: 100.822 ms (end to end 100.828 ms)
[06/16/2020-01:58:01] [I] median: 100.73 ms (end to end 100.737 ms)
[06/16/2020-01:58:01] [I] percentile: 102.077 ms at 99% (end to end 102.084 ms at 99%)
[06/16/2020-01:58:01] [I] throughput: 9.91782 qps
[06/16/2020-01:58:01] [I] walltime: 3.22652 s
[06/16/2020-01:58:01] [I] GPU Compute
[06/16/2020-01:58:01] [I] min: 99.9324 ms
[06/16/2020-01:58:01] [I] max: 101.429 ms
[06/16/2020-01:58:01] [I] mean: 100.181 ms
[06/16/2020-01:58:01] [I] median: 100.091 ms
[06/16/2020-01:58:01] [I] percentile: 101.429 ms at 99%
[06/16/2020-01:58:01] [I] total compute time: 3.20579 s
&&&& PASSED TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --loadEngine=resnet34_peoplenet_0.005.engine --fp16 --batch=1 --useSpinWait

==> the fps is about 10

Still the same

# export int8 version
!tlt-export detectnet_v2 -m $USER_EXPERIMENT_DIR/habbof_dir_retrain/weights/model.tlt -o $USER_EXPERIMENT_DIR/export/habbof_peoplenet_0005_8.etlt -e $SPECS_DIR/peoplenet_retrain_kitti.txt -k tlt_encode --cal_image_dir $TEST_DIR --data_type int8 --batch_size 1 --batches 10 --cal_cache_file $USER_EXPERIMENT_DIR/export/habbof_peoplenet_0005.bin  \
        --cal_data_file $USER_EXPERIMENT_DIR/export/caldef.tensorfile

# export fp16 version
!tlt-export detectnet_v2 -m $USER_EXPERIMENT_DIR/habbof_dir_retrain/weights/model.tlt -o $USER_EXPERIMENT_DIR/export/habbof_peoplenet_0005_16.etlt -e $SPECS_DIR/peoplenet_retrain_kitti.txt -k tlt_encode --data_type fp16'

On nano I convert fp16

./tlt-converter habbof_peoplenet_0005_16.etlt
-k tlt_encode
-o output_bbox/BiasAdd,output_cov/Sigmoid
-d 3,544,960
-i nchw
-m 8 -t fp16
-e habbof_00051.engine
-b 64
-w 1000000000

and I am checking like you did with

 &&&& RUNNING TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --loadEngine=habbof_00051.engine --fp16 --batch=1 --useSpinWait
[06/19/2020-16:01:46] [I] === Model Options ===
[06/19/2020-16:01:46] [I] Format: *
[06/19/2020-16:01:46] [I] Model: 
[06/19/2020-16:01:46] [I] Output:
[06/19/2020-16:01:46] [I] === Build Options ===
[06/19/2020-16:01:46] [I] Max batch: 1
[06/19/2020-16:01:46] [I] Workspace: 16 MB
[06/19/2020-16:01:46] [I] minTiming: 1
[06/19/2020-16:01:46] [I] avgTiming: 8
[06/19/2020-16:01:46] [I] Precision: FP32+FP16
[06/19/2020-16:01:46] [I] Calibration: 
[06/19/2020-16:01:46] [I] Safe mode: Disabled
[06/19/2020-16:01:46] [I] Save engine: 
[06/19/2020-16:01:46] [I] Load engine: habbof_00051.engine
[06/19/2020-16:01:46] [I] Builder Cache: Enabled
[06/19/2020-16:01:46] [I] NVTX verbosity: 0
[06/19/2020-16:01:46] [I] Inputs format: fp32:CHW
[06/19/2020-16:01:46] [I] Outputs format: fp32:CHW
[06/19/2020-16:01:46] [I] Input build shapes: model
[06/19/2020-16:01:46] [I] Input calibration shapes: model
[06/19/2020-16:01:46] [I] === System Options ===
[06/19/2020-16:01:46] [I] Device: 0
[06/19/2020-16:01:46] [I] DLACore: 
[06/19/2020-16:01:46] [I] Plugins:
[06/19/2020-16:01:46] [I] === Inference Options ===
[06/19/2020-16:01:46] [I] Batch: 1
[06/19/2020-16:01:46] [I] Input inference shapes: model
[06/19/2020-16:01:46] [I] Iterations: 10
[06/19/2020-16:01:46] [I] Duration: 3s (+ 200ms warm up)
[06/19/2020-16:01:46] [I] Sleep time: 0ms
[06/19/2020-16:01:46] [I] Streams: 1
[06/19/2020-16:01:46] [I] ExposeDMA: Disabled
[06/19/2020-16:01:46] [I] Spin-wait: Enabled
[06/19/2020-16:01:46] [I] Multithreading: Disabled
[06/19/2020-16:01:46] [I] CUDA Graph: Disabled
[06/19/2020-16:01:46] [I] Skip inference: Disabled
[06/19/2020-16:01:46] [I] Inputs:
[06/19/2020-16:01:46] [I] === Reporting Options ===
[06/19/2020-16:01:46] [I] Verbose: Disabled
[06/19/2020-16:01:46] [I] Averages: 10 inferences
[06/19/2020-16:01:46] [I] Percentile: 99
[06/19/2020-16:01:46] [I] Dump output: Disabled
[06/19/2020-16:01:46] [I] Profile: Disabled
[06/19/2020-16:01:46] [I] Export timing to JSON file: 
[06/19/2020-16:01:46] [I] Export output to JSON file: 
[06/19/2020-16:01:46] [I] Export profile to JSON file: 
[06/19/2020-16:01:46] [I] 
[06/19/2020-16:01:51] [I] Starting inference threads
[06/19/2020-16:01:54] [I] Warmup completed 2 queries over 200 ms
[06/19/2020-16:01:54] [I] Timing trace has 24 queries over 3.3202 s
[06/19/2020-16:01:54] [I] Trace averages of 10 runs:
[06/19/2020-16:01:54] [I] Average on 10 runs - GPU latency: 137.637 ms - Host latency: 138.281 ms (end to end 138.286 ms)
[06/19/2020-16:01:54] [I] Average on 10 runs - GPU latency: 137.653 ms - Host latency: 138.299 ms (end to end 138.304 ms)
[06/19/2020-16:01:54] [I] Host latency
[06/19/2020-16:01:54] [I] min: 137.078 ms (end to end 137.085 ms)
[06/19/2020-16:01:54] [I] max: 138.757 ms (end to end 138.762 ms)
[06/19/2020-16:01:54] [I] mean: 138.335 ms (end to end 138.341 ms)
[06/19/2020-16:01:54] [I] median: 138.552 ms (end to end 138.557 ms)
[06/19/2020-16:01:54] [I] percentile: 138.757 ms at 99% (end to end 138.762 ms at 99%)
[06/19/2020-16:01:54] [I] throughput: 7.22848 qps
[06/19/2020-16:01:54] [I] walltime: 3.3202 s
[06/19/2020-16:01:54] [I] GPU Compute
[06/19/2020-16:01:54] [I] min: 136.436 ms
[06/19/2020-16:01:54] [I] max: 138.111 ms
[06/19/2020-16:01:54] [I] mean: 137.692 ms
[06/19/2020-16:01:54] [I] median: 137.908 ms
[06/19/2020-16:01:54] [I] percentile: 138.111 ms at 99%
[06/19/2020-16:01:54] [I] total compute time: 3.30461 s

Also tried with int8 and with treshold 0.5 0.00005 still same old 7.3 fps

Nano does not support int8 precision. So you need not do test int8 trt engine.

One question, if you use the official unpruned tlt model and prune it as I mentioned, then test it with my steps, can you reach 10fps? I want to check if you can reproduce my result on your nano with the same step.

If I reproduce each step I am getting 10 fps but if I am taking in first step my trained model instead of unpruned default model I have only 7.2 fps. I think something is wrong with my train.txt could yo advise me what is wrong

random_seed: 42

dataset_config {
    data_sources: {
      tfrecords_path: "/workspace/tlt-experiments/data_resolution/tfrecords/kitti_trainval/kitti_trainval*"
       image_directory_path: "/workspace/tlt-experiments/data_resolution/training"
    }
    image_extension: "png"
      target_class_mapping {
          key: "emplo"
          value: "emplo"
      }
    validation_fold: 0
}

model_config {
  pretrained_model_file: "/workspace/tlt-experiments/people_milk/pretrained_peoplenet/tlt_peoplenet_vunpruned_v1.0/resnet34_peoplenet.tlt"
  num_layers: 34
  freeze_blocks: 0
  arch: "resnet"
  use_batch_norm: true
  objective_set {
    bbox {
      scale: 35.0
      offset: 0.5
    }
    cov {
    }
  }
  training_precision {
    backend_floatx: FLOAT32
  }
}

cost_function_config {
  target_classes {
    name: "emplo"
    class_weight: 1.0
    coverage_foreground_weight: 0.0500000007451
    objectives {
      name: "cov"
      initial_weight: 1.0
      weight_target: 1.0
    }
    objectives {
      name: "bbox"
      initial_weight: 10.0
      weight_target: 10.0
    }
  }
  enable_autoweighting: true
  max_objective_weight: 0.999899983406
  min_objective_weight: 9.99999974738e-05
}

training_config {
batch_size_per_gpu: 12
num_epochs: 20
learning_rate {
  soft_start_annealing_schedule {
    min_learning_rate: 5e-06
    max_learning_rate: 0.0005
    soft_start: 0.1
    annealing: 0.7
  }
}
regularizer {
  type: L1
  weight: 3e-09
}
optimizer {
  adam {
    epsilon: 9.9e-09
    beta1: 0.9
    beta2: 0.999
  }
}
cost_scaling {
  initial_exponent: 20.0
  increment: 0.005
  decrement: 1.0
}
checkpoint_interval: 5
}

augmentation_config {
 preprocessing {
 output_image_width: 960
 output_image_height: 544
 output_image_channel: 3
 min_bbox_width: 1.0
 min_bbox_height: 1.0
 }
 spatial_augmentation {
   hflip_probability: 0.5
   zoom_min: 1.0
   zoom_max: 1.0
   translate_max_x: 8.0
   translate_max_y: 8.0
 }
 color_augmentation {
   hue_rotation_max: 25.0
   saturation_shift_max: 0.20000000298
   contrast_scale_max: 0.10000000149
   contrast_center: 0.5
 }
}

postprocessing_config{
 target_class_config{
   key: "emplo"
   value: {
     clustering_config {
       coverage_threshold: 0.005
       dbscan_eps: 0.265
       dbscan_min_samples: 0.05
       minimum_bounding_box_height: 4
     }
   }
 }
}

bbox_rasterizer_config {
  target_class_config {
    key: "emplo"
    value {
      cov_center_x: 0.5
      cov_center_y: 0.5
      cov_radius_x: 0.40000000596
      cov_radius_y: 0.40000000596
      bbox_min_radius: 1.0
    }
  }
  deadzone_radius: 0.400000154972
}


evaluation_config {
 validation_period_during_training: 10
 first_validation_epoch: 1
 minimum_detection_ground_truth_overlap {
   key: "emplo"
   value: 0.5
 }
 evaluation_box_config {
   key: "emplo"
   value {
     minimum_height: 20
     maximum_height: 9999
     minimum_width: 4
     maximum_width: 9999
   }
 }
}

What is the size of your pruned model?
Suggest you to prune more.

I tried with 0.00005 and also with 0.5, still the same 7fps and now I will try with 0.000005 and I will see but for now .

Below I printed all pruned model sizes
peoplenet_resnet34_pruned0000005.tlt should be read as peoplenet_resnet34_pruned with threshold 0.000005

25M my_peoplenet_0.005.tlt
 68M peoplenet_resnet34_pruned0000005.tlt
 25M peoplenet_resnet34_pruned000005.tlt
 25M peoplenet_resnet34_pruned0005.tlt
3.1M peoplenet_resnet34_pruned05.tlt

How about the fps for 3.1M tlt model?

There is no update from you for a period, assuming this is not an issue any more.
Hence we are closing this topic. If need further support, please open a new one.
Thanks

Hi vladimir.zaigrajew,

Is this still an issue to support? Any result can be shared? Thanks