Accelerating Peoplnet with tlt for jetson nano

vladimir.zaigrajew · May 27, 2020, 9:33am

Hello,

I’m trying to retrain a PeopleNet model using custom data. I followed the guidelines in https://devblogs.nvidia.com/training-custom-pretrained-models-using-tlt/, I succesfully trained and tested on jetson nano deepstream my retrained Peoplenet. In the case of deepstream sample of Peoplenet

$ /opt/nvidia/deepstream/deepstream-5.0/samples/configs/tlt_pretrained_models/deepstream_app_source1_peoplenet.txt

it took 10 fps, but none of my trained versions reached that fps score.

I re-trained unpruned Peoplenet version with 34 layouts and it took 2,5fps. When I changed 34 layers to 18 (propsed in nvc sample of peoplenet), i got 4,5 fps.
Even when i used pruned version and only converted it and proced with deepstream, i got only 6 fps.

I know that pruning process optimalizes networks but how low does the threshold need to be to get to be so fast? Or it could be due to resolution I passed (1280x720), but I don’t think so.

Could you explain to me how to make my retrain Peoplent model get 10fps on jetson nano?

Morganh · May 28, 2020, 3:47am

Firstly, have you boosted the clocks?

$ sudo -s nvpmodel -m 0

$ jetson_clocks

vladimir.zaigrajew · June 1, 2020, 1:31pm

Yes I boosted the clocks, but still my tlt-peoplnet fps did not increased

rajaneconsys · June 1, 2020, 4:57pm

Hi @vladimir.zaigrajew
Actually i am also trying to re-train peoplenet model but i am getting illegal instruction core dumped error. i already raised a question in forum. please find below the link for that:

and one thing i want to know how to deploy and test pruned model on jetson nano for inferencing.
Can you please help me in that. Thanks in Advance:)

Morganh · June 1, 2020, 5:50pm

@rajaneconsys
For “illegal instruction” error, please try another host PC if possible.

Morganh · June 1, 2020, 5:54pm

@vladimir.zaigrajew
For peoplenet, it is trained with 960 x 544 and ResNet34.
If possible, you can resize your images/labels to 960x544 and retry.
More, pruning plays an important role. Please check the released pruned peoplenet model size and try to prune your current model to similar size model. Remember to retrain after pruning.

vladimir.zaigrajew · June 4, 2020, 3:57pm

I trained peoplenet with my dataset 960x544 and ResNet34 and on jetson I got even worse performance 3.8-4.0 fps.
I pruned with pth 0.005 (3.8 fps) and 0.5 (4.0 fps). My specs file are almost the same as in the https://devblogs.nvidia.com/training-custom-pretrained-models-using-tlt/

random_seed: 42

dataset_config {
data_sources: {
  tfrecords_path: "/workspace/tlt-experiments/data_resolution/tfrecords/kitti_trainval/kitti_trainval*"
   image_directory_path: "/workspace/tlt-experiments/data_resolution/training"
}
image_extension: "png"
  target_class_mapping {
      key: "emplo"
      value: "emplo"
  }
validation_fold: 0
}


model_config {
  pretrained_model_file: "/workspace/tlt-experiments/people_milk/pretrained_peoplenet/tlt_peoplenet_vunpruned_v1.0/resnet34_peoplenet.tlt"
  num_layers: 34
  freeze_blocks: 0
  arch: "resnet"
  use_batch_norm: true
  objective_set {
bbox {
  scale: 35.0
  offset: 0.5
}
cov {
}
  }
  training_precision {
backend_floatx: FLOAT32
  }
}

training_config {
batch_size_per_gpu: 12
num_epochs: 40
learning_rate {
  soft_start_annealing_schedule {
min_learning_rate: 5e-06
max_learning_rate: 0.0005
soft_start: 0.1
annealing: 0.7
  }
}
regularizer {
  type: L1
  weight: 3e-09
}
optimizer {
  adam {
epsilon: 9.9e-09
beta1: 0.9
beta2: 0.999
  }
}
cost_scaling {
  initial_exponent: 20.0
  increment: 0.005
  decrement: 1.0
}
checkpoint_interval: 10
}

augmentation_config {
 preprocessing {
 output_image_width: 960
 output_image_height: 544
 output_image_channel: 3
 crop_right: 960
 crop_bottom: 544
 min_bbox_width: 1.0
 min_bbox_height: 1.0
 }
 spatial_augmentation {
   hflip_probability: 0.5
   zoom_min: 1.0
   zoom_max: 1.0
   translate_max_x: 8.0
   translate_max_y: 8.0
 }
 color_augmentation {
   hue_rotation_max: 25.0
   saturation_shift_max: 0.20000000298
   contrast_scale_max: 0.10000000149
   contrast_center: 0.5
 }
}

cost_function_config {
  target_classes {
name: "emplo"
class_weight: 1.0
coverage_foreground_weight: 0.0500000007451
objectives {
  name: "cov"
  initial_weight: 1.0
  weight_target: 1.0
}
objectives {
  name: "bbox"
  initial_weight: 10.0
  weight_target: 10.0
}
  }
  enable_autoweighting: true
  max_objective_weight: 0.999899983406
  min_objective_weight: 9.99999974738e-05
}

bbox_rasterizer_config {
  target_class_config {
key: "emplo"
value {
  cov_center_x: 0.5
  cov_center_y: 0.5
  cov_radius_x: 0.40000000596
  cov_radius_y: 0.40000000596
  bbox_min_radius: 1.0
}
  }
  deadzone_radius: 0.400000154972
}

postprocessing_config{
 target_class_config{
   key: "emplo"
   value: {
 clustering_config {
   coverage_threshold: 0.005
   dbscan_eps: 0.265
   dbscan_min_samples: 0.05
   minimum_bounding_box_height: 4
 }
   }
 }
}


evaluation_config {
 validation_period_during_training: 10
 first_validation_epoch: 10
 minimum_detection_ground_truth_overlap {
   key: "emplo"
   value: 0.5
 }
 evaluation_box_config {
   key: "emplo"
   value {
 minimum_height: 4
 maximum_height: 9999
 minimum_width: 4
 maximum_width: 9999
   }
 }
}

I find out that pruned version from ngc is working on 10.5 fps, so something must be wrong with training or pruning

Morganh · June 5, 2020, 3:10am

What is the size of your trained pruned tlt model? Please run “ll -sh xxx”

Morganh · June 5, 2020, 3:13am

And also could you please paste the full log when you run tlt-prune?

vladimir.zaigrajew · June 14, 2020, 6:45pm

The fps always stay the same

Morganh · June 15, 2020, 10:30am

How about the fps if you prune more?
More, could you please share how did you test fps?

Morganh · June 16, 2020, 9:40am

For your reference, I run below experiment. It shows

the fps of unpruned tlt model : about 3.9fps
the fps of pruned_0.005 tlt model : about 10fps

root@1240072:/workspace/tlt-experiments# cat tlt-export_fp16_unprune.sh
tlt-export detectnet_v2 -m resnet34_peoplenet.tlt
-k tlt_encode
-o resnet34_peoplenet.etlt
–data_type fp16

root@1240072:/workspace/tlt-experiments# sh tlt-export_fp16_unpruned.sh
Using TensorFlow backend.
NOTE: UFF has been tested with TensorFlow 1.14.0.
WARNING: The version of TensorFlow installed on this system is not guaranteed to work with UFF.
2020-06-15 16:11:10,726 [INFO] modulus.export._uff: Modulus patch identity layer in padding inputs.
2020-06-15 16:11:15,902 [INFO] modulus.export._uff: Modulus patch identity layer in padding inputs.
DEBUG [/usr/lib/python2.7/dist-packages/uff/converters/tensorflow/converter.py:96] Marking [‘output_cov/Sigmoid’, ‘output_bbox/BiasAdd’] as outputs
[TensorRT] INFO: Some tactics do not have sufficient workspace memory to run. Increasing workspace size may increase performance, please check verbose output.
[TensorRT] INFO: Detected 1 inputs and 2 output network tensors.

root@1240072:/workspace/tlt-experiments# cat prune_0.005.sh
tlt-prune -m resnet34_peoplenet.tlt -o resnet34_peoplenet_0.005.tlt -pth 0.005 -eq union -k tlt_encode

root@1240072:/workspace/tlt-experiments# sh prune_0.005.sh
Using TensorFlow backend.
2020-06-15 16:18:48,488 [INFO] modulus.pruning.pruning: Exploring graph for retainable indices
2020-06-15 16:18:50,906 [INFO] modulus.pruning.pruning: Pruning model and appending pruned nodes to new graph
2020-06-15 16:20:55,067 [INFO] iva.common.magnet_prune: Pruning ratio (pruned model / original model): 0.143708735585

root@1240072:/workspace/tlt-experiments# cat tlt-export_fp16_0.005.sh
tlt-export detectnet_v2 -m resnet34_peoplenet_0.005.tlt
-k tlt_encode
-o resnet34_peoplenet_0.005.etlt
–data_type fp16

root@1240072:/workspace/tlt-experiments# sh tlt-export_fp16_0.005.sh
Using TensorFlow backend.
NOTE: UFF has been tested with TensorFlow 1.14.0.
WARNING: The version of TensorFlow installed on this system is not guaranteed to work with UFF.
2020-06-15 16:24:57,321 [INFO] modulus.export._uff: Modulus patch identity layer in padding inputs.
2020-06-15 16:25:02,810 [INFO] modulus.export._uff: Modulus patch identity layer in padding inputs.
DEBUG [/usr/lib/python2.7/dist-packages/uff/converters/tensorflow/converter.py:96] Marking [‘output_cov/Sigmoid’, ‘output_bbox/BiasAdd’] as outputs
[TensorRT] INFO: Some tactics do not have sufficient workspace memory to run. Increasing workspace size may increase performance, please check verbose output.
[TensorRT] INFO: Detected 1 inputs and 2 output network tensors.

In nano,

$ cat generate_fp16_engine_unpruned.sh
tlt-converter resnet34_peoplenet.etlt
-k tlt_encode
-o output_bbox/BiasAdd,output_cov/Sigmoid
-d 3,544,960
-i nchw
-m 8 -t fp16
-e resnet34_peoplenet.engine
-b 64
-w 1000000000

$ sh generate_fp16_engine_unpruned.sh

$ /usr/src/tensorrt/bin/trtexec --loadEngine=resnet34_peoplenet.engine --fp16 --batch=1 --useSpinWait
&&&& RUNNING TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --loadEngine=resnet34_peoplenet.engine --fp16 --batch=1 --useSpinWait
[06/16/2020-01:34:41] [I] === Model Options ===
[06/16/2020-01:34:41] [I] Format: *
[06/16/2020-01:34:41] [I] Model:
[06/16/2020-01:34:41] [I] Output:
[06/16/2020-01:34:41] [I] === Build Options ===
[06/16/2020-01:34:41] [I] Max batch: 1
[06/16/2020-01:34:41] [I] Workspace: 16 MB
[06/16/2020-01:34:41] [I] minTiming: 1
[06/16/2020-01:34:41] [I] avgTiming: 8
[06/16/2020-01:34:41] [I] Precision: FP32+FP16
[06/16/2020-01:34:41] [I] Calibration:
[06/16/2020-01:34:41] [I] Safe mode: Disabled
[06/16/2020-01:34:41] [I] Save engine:
[06/16/2020-01:34:41] [I] Load engine: resnet34_peoplenet.engine
[06/16/2020-01:34:41] [I] Builder Cache: Enabled
[06/16/2020-01:34:41] [I] NVTX verbosity: 0
[06/16/2020-01:34:41] [I] Inputs format: fp32:CHW
[06/16/2020-01:34:41] [I] Outputs format: fp32:CHW
[06/16/2020-01:34:41] [I] Input build shapes: model
[06/16/2020-01:34:41] [I] Input calibration shapes: model
[06/16/2020-01:34:41] [I] === System Options ===
[06/16/2020-01:34:41] [I] Device: 0
[06/16/2020-01:34:41] [I] DLACore:
[06/16/2020-01:34:41] [I] Plugins:
[06/16/2020-01:34:41] [I] === Inference Options ===
[06/16/2020-01:34:41] [I] Batch: 1
[06/16/2020-01:34:41] [I] Input inference shapes: model
[06/16/2020-01:34:41] [I] Iterations: 10
[06/16/2020-01:34:41] [I] Duration: 3s (+ 200ms warm up)
[06/16/2020-01:34:41] [I] Sleep time: 0ms
[06/16/2020-01:34:41] [I] Streams: 1
[06/16/2020-01:34:41] [I] ExposeDMA: Disabled
[06/16/2020-01:34:41] [I] Spin-wait: Enabled
[06/16/2020-01:34:41] [I] Multithreading: Disabled
[06/16/2020-01:34:41] [I] CUDA Graph: Disabled
[06/16/2020-01:34:41] [I] Skip inference: Disabled
[06/16/2020-01:34:41] [I] Inputs:
[06/16/2020-01:34:41] [I] === Reporting Options ===
[06/16/2020-01:34:41] [I] Verbose: Disabled
[06/16/2020-01:34:41] [I] Averages: 10 inferences
[06/16/2020-01:34:41] [I] Percentile: 99
[06/16/2020-01:34:41] [I] Dump output: Disabled
[06/16/2020-01:34:41] [I] Profile: Disabled
[06/16/2020-01:34:41] [I] Export timing to JSON file:
[06/16/2020-01:34:41] [I] Export output to JSON file:
[06/16/2020-01:34:41] [I] Export profile to JSON file:
[06/16/2020-01:34:41] [I]
[06/16/2020-01:34:46] [I] Starting inference threads
[06/16/2020-01:34:50] [I] Warmup completed 1 queries over 200 ms
[06/16/2020-01:34:50] [I] Timing trace has 14 queries over 3.58136 s
[06/16/2020-01:34:50] [I] Trace averages of 10 runs:
[06/16/2020-01:34:50] [I] Average on 10 runs - GPU latency: 255.143 ms - Host latency: 255.79 ms (end to end 255.796 ms)
[06/16/2020-01:34:50] [I] Host latency
[06/16/2020-01:34:50] [I] min: 255.455 ms (end to end 255.461 ms)
[06/16/2020-01:34:50] [I] max: 255.986 ms (end to end 255.992 ms)
[06/16/2020-01:34:50] [I] mean: 255.805 ms (end to end 255.811 ms)
[06/16/2020-01:34:50] [I] median: 255.826 ms (end to end 255.833 ms)
[06/16/2020-01:34:50] [I] percentile: 255.986 ms at 99% (end to end 255.992 ms at 99%)
[06/16/2020-01:34:50] [I] throughput: 3.90913 qps
[06/16/2020-01:34:50] [I] walltime: 3.58136 s
[06/16/2020-01:34:50] [I] GPU Compute
[06/16/2020-01:34:50] [I] min: 254.812 ms
[06/16/2020-01:34:50] [I] max: 255.337 ms
[06/16/2020-01:34:50] [I] mean: 255.159 ms
[06/16/2020-01:34:50] [I] median: 255.178 ms
[06/16/2020-01:34:50] [I] percentile: 255.337 ms at 99%
[06/16/2020-01:34:50] [I] total compute time: 3.57223 s
&&&& PASSED TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --loadEngine=resnet34_peoplenet.engine --fp16 --batch=1 --useSpinWait

==> the fps is about 3.9

$ cat generate_fp16_engine_0.005.sh
tlt-converter-7.1 resnet34_peoplenet_0.005.etlt
-k tlt_encode
-o output_bbox/BiasAdd,output_cov/Sigmoid
-d 3,544,960
-i nchw
-m 8 -t fp16
-e resnet34_peoplenet_0.005.engine
-b 64
-w 1000000000

$ sh generate_fp16_engine_0.005.sh

$ /usr/src/tensorrt/bin/trtexec --loadEngine=resnet34_peoplenet_0.005.engine --fp16 --batch=1 --useSpinWait
&&&& RUNNING TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --loadEngine=resnet34_peoplenet_0.005.engine --fp16 --batch=1 --useSpinWait
[06/16/2020-01:57:53] [I] === Model Options ===
[06/16/2020-01:57:53] [I] Format: *
[06/16/2020-01:57:53] [I] Model:
[06/16/2020-01:57:53] [I] Output:
[06/16/2020-01:57:53] [I] === Build Options ===
[06/16/2020-01:57:53] [I] Max batch: 1
[06/16/2020-01:57:53] [I] Workspace: 16 MB
[06/16/2020-01:57:53] [I] minTiming: 1
[06/16/2020-01:57:53] [I] avgTiming: 8
[06/16/2020-01:57:53] [I] Precision: FP32+FP16
[06/16/2020-01:57:53] [I] Calibration:
[06/16/2020-01:57:53] [I] Safe mode: Disabled
[06/16/2020-01:57:53] [I] Save engine:
[06/16/2020-01:57:53] [I] Load engine: resnet34_peoplenet_0.005.engine
[06/16/2020-01:57:53] [I] Builder Cache: Enabled
[06/16/2020-01:57:53] [I] NVTX verbosity: 0
[06/16/2020-01:57:53] [I] Inputs format: fp32:CHW
[06/16/2020-01:57:53] [I] Outputs format: fp32:CHW
[06/16/2020-01:57:53] [I] Input build shapes: model
[06/16/2020-01:57:53] [I] Input calibration shapes: model
[06/16/2020-01:57:53] [I] === System Options ===
[06/16/2020-01:57:53] [I] Device: 0
[06/16/2020-01:57:53] [I] DLACore:
[06/16/2020-01:57:53] [I] Plugins:
[06/16/2020-01:57:53] [I] === Inference Options ===
[06/16/2020-01:57:53] [I] Batch: 1
[06/16/2020-01:57:53] [I] Input inference shapes: model
[06/16/2020-01:57:53] [I] Iterations: 10
[06/16/2020-01:57:53] [I] Duration: 3s (+ 200ms warm up)
[06/16/2020-01:57:53] [I] Sleep time: 0ms
[06/16/2020-01:57:53] [I] Streams: 1
[06/16/2020-01:57:53] [I] ExposeDMA: Disabled
[06/16/2020-01:57:53] [I] Spin-wait: Enabled
[06/16/2020-01:57:53] [I] Multithreading: Disabled
[06/16/2020-01:57:53] [I] CUDA Graph: Disabled
[06/16/2020-01:57:53] [I] Skip inference: Disabled
[06/16/2020-01:57:53] [I] Inputs:
[06/16/2020-01:57:53] [I] === Reporting Options ===
[06/16/2020-01:57:53] [I] Verbose: Disabled
[06/16/2020-01:57:53] [I] Averages: 10 inferences
[06/16/2020-01:57:53] [I] Percentile: 99
[06/16/2020-01:57:53] [I] Dump output: Disabled
[06/16/2020-01:57:53] [I] Profile: Disabled
[06/16/2020-01:57:53] [I] Export timing to JSON file:
[06/16/2020-01:57:53] [I] Export output to JSON file:
[06/16/2020-01:57:53] [I] Export profile to JSON file:
[06/16/2020-01:57:53] [I]
[06/16/2020-01:57:57] [I] Starting inference threads
[06/16/2020-01:58:01] [I] Warmup completed 2 queries over 200 ms
[06/16/2020-01:58:01] [I] Timing trace has 32 queries over 3.22652 s
[06/16/2020-01:58:01] [I] Trace averages of 10 runs:
[06/16/2020-01:58:01] [I] Average on 10 runs - GPU latency: 100.122 ms - Host latency: 100.763 ms (end to end 100.769 ms)
[06/16/2020-01:58:01] [I] Average on 10 runs - GPU latency: 100.161 ms - Host latency: 100.803 ms (end to end 100.809 ms)
[06/16/2020-01:58:01] [I] Average on 10 runs - GPU latency: 100.253 ms - Host latency: 100.896 ms (end to end 100.902 ms)
[06/16/2020-01:58:01] [I] Host latency
[06/16/2020-01:58:01] [I] min: 100.557 ms (end to end 100.562 ms)
[06/16/2020-01:58:01] [I] max: 102.077 ms (end to end 102.084 ms)
[06/16/2020-01:58:01] [I] mean: 100.822 ms (end to end 100.828 ms)
[06/16/2020-01:58:01] [I] median: 100.73 ms (end to end 100.737 ms)
[06/16/2020-01:58:01] [I] percentile: 102.077 ms at 99% (end to end 102.084 ms at 99%)
[06/16/2020-01:58:01] [I] throughput: 9.91782 qps
[06/16/2020-01:58:01] [I] walltime: 3.22652 s
[06/16/2020-01:58:01] [I] GPU Compute
[06/16/2020-01:58:01] [I] min: 99.9324 ms
[06/16/2020-01:58:01] [I] max: 101.429 ms
[06/16/2020-01:58:01] [I] mean: 100.181 ms
[06/16/2020-01:58:01] [I] median: 100.091 ms
[06/16/2020-01:58:01] [I] percentile: 101.429 ms at 99%
[06/16/2020-01:58:01] [I] total compute time: 3.20579 s
&&&& PASSED TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --loadEngine=resnet34_peoplenet_0.005.engine --fp16 --batch=1 --useSpinWait

==> the fps is about 10

vladimir.zaigrajew · June 19, 2020, 6:55pm

Still the same

# export int8 version
!tlt-export detectnet_v2 -m $USER_EXPERIMENT_DIR/habbof_dir_retrain/weights/model.tlt -o $USER_EXPERIMENT_DIR/export/habbof_peoplenet_0005_8.etlt -e $SPECS_DIR/peoplenet_retrain_kitti.txt -k tlt_encode --cal_image_dir $TEST_DIR --data_type int8 --batch_size 1 --batches 10 --cal_cache_file $USER_EXPERIMENT_DIR/export/habbof_peoplenet_0005.bin  \
        --cal_data_file $USER_EXPERIMENT_DIR/export/caldef.tensorfile

# export fp16 version
!tlt-export detectnet_v2 -m $USER_EXPERIMENT_DIR/habbof_dir_retrain/weights/model.tlt -o $USER_EXPERIMENT_DIR/export/habbof_peoplenet_0005_16.etlt -e $SPECS_DIR/peoplenet_retrain_kitti.txt -k tlt_encode --data_type fp16'

On nano I convert fp16

./tlt-converter habbof_peoplenet_0005_16.etlt
-k tlt_encode
-o output_bbox/BiasAdd,output_cov/Sigmoid
-d 3,544,960
-i nchw
-m 8 -t fp16
-e habbof_00051.engine
-b 64
-w 1000000000

and I am checking like you did with

 &&&& RUNNING TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --loadEngine=habbof_00051.engine --fp16 --batch=1 --useSpinWait
[06/19/2020-16:01:46] [I] === Model Options ===
[06/19/2020-16:01:46] [I] Format: *
[06/19/2020-16:01:46] [I] Model: 
[06/19/2020-16:01:46] [I] Output:
[06/19/2020-16:01:46] [I] === Build Options ===
[06/19/2020-16:01:46] [I] Max batch: 1
[06/19/2020-16:01:46] [I] Workspace: 16 MB
[06/19/2020-16:01:46] [I] minTiming: 1
[06/19/2020-16:01:46] [I] avgTiming: 8
[06/19/2020-16:01:46] [I] Precision: FP32+FP16
[06/19/2020-16:01:46] [I] Calibration: 
[06/19/2020-16:01:46] [I] Safe mode: Disabled
[06/19/2020-16:01:46] [I] Save engine: 
[06/19/2020-16:01:46] [I] Load engine: habbof_00051.engine
[06/19/2020-16:01:46] [I] Builder Cache: Enabled
[06/19/2020-16:01:46] [I] NVTX verbosity: 0
[06/19/2020-16:01:46] [I] Inputs format: fp32:CHW
[06/19/2020-16:01:46] [I] Outputs format: fp32:CHW
[06/19/2020-16:01:46] [I] Input build shapes: model
[06/19/2020-16:01:46] [I] Input calibration shapes: model
[06/19/2020-16:01:46] [I] === System Options ===
[06/19/2020-16:01:46] [I] Device: 0
[06/19/2020-16:01:46] [I] DLACore: 
[06/19/2020-16:01:46] [I] Plugins:
[06/19/2020-16:01:46] [I] === Inference Options ===
[06/19/2020-16:01:46] [I] Batch: 1
[06/19/2020-16:01:46] [I] Input inference shapes: model
[06/19/2020-16:01:46] [I] Iterations: 10
[06/19/2020-16:01:46] [I] Duration: 3s (+ 200ms warm up)
[06/19/2020-16:01:46] [I] Sleep time: 0ms
[06/19/2020-16:01:46] [I] Streams: 1
[06/19/2020-16:01:46] [I] ExposeDMA: Disabled
[06/19/2020-16:01:46] [I] Spin-wait: Enabled
[06/19/2020-16:01:46] [I] Multithreading: Disabled
[06/19/2020-16:01:46] [I] CUDA Graph: Disabled
[06/19/2020-16:01:46] [I] Skip inference: Disabled
[06/19/2020-16:01:46] [I] Inputs:
[06/19/2020-16:01:46] [I] === Reporting Options ===
[06/19/2020-16:01:46] [I] Verbose: Disabled
[06/19/2020-16:01:46] [I] Averages: 10 inferences
[06/19/2020-16:01:46] [I] Percentile: 99
[06/19/2020-16:01:46] [I] Dump output: Disabled
[06/19/2020-16:01:46] [I] Profile: Disabled
[06/19/2020-16:01:46] [I] Export timing to JSON file: 
[06/19/2020-16:01:46] [I] Export output to JSON file: 
[06/19/2020-16:01:46] [I] Export profile to JSON file: 
[06/19/2020-16:01:46] [I] 
[06/19/2020-16:01:51] [I] Starting inference threads
[06/19/2020-16:01:54] [I] Warmup completed 2 queries over 200 ms
[06/19/2020-16:01:54] [I] Timing trace has 24 queries over 3.3202 s
[06/19/2020-16:01:54] [I] Trace averages of 10 runs:
[06/19/2020-16:01:54] [I] Average on 10 runs - GPU latency: 137.637 ms - Host latency: 138.281 ms (end to end 138.286 ms)
[06/19/2020-16:01:54] [I] Average on 10 runs - GPU latency: 137.653 ms - Host latency: 138.299 ms (end to end 138.304 ms)
[06/19/2020-16:01:54] [I] Host latency
[06/19/2020-16:01:54] [I] min: 137.078 ms (end to end 137.085 ms)
[06/19/2020-16:01:54] [I] max: 138.757 ms (end to end 138.762 ms)
[06/19/2020-16:01:54] [I] mean: 138.335 ms (end to end 138.341 ms)
[06/19/2020-16:01:54] [I] median: 138.552 ms (end to end 138.557 ms)
[06/19/2020-16:01:54] [I] percentile: 138.757 ms at 99% (end to end 138.762 ms at 99%)
[06/19/2020-16:01:54] [I] throughput: 7.22848 qps
[06/19/2020-16:01:54] [I] walltime: 3.3202 s
[06/19/2020-16:01:54] [I] GPU Compute
[06/19/2020-16:01:54] [I] min: 136.436 ms
[06/19/2020-16:01:54] [I] max: 138.111 ms
[06/19/2020-16:01:54] [I] mean: 137.692 ms
[06/19/2020-16:01:54] [I] median: 137.908 ms
[06/19/2020-16:01:54] [I] percentile: 138.111 ms at 99%
[06/19/2020-16:01:54] [I] total compute time: 3.30461 s

Also tried with int8 and with treshold 0.5 0.00005 still same old 7.3 fps

Morganh · June 20, 2020, 8:18am

Nano does not support int8 precision. So you need not do test int8 trt engine.

One question, if you use the official unpruned tlt model and prune it as I mentioned, then test it with my steps, can you reach 10fps? I want to check if you can reproduce my result on your nano with the same step.

vladimir.zaigrajew · June 25, 2020, 7:11pm

If I reproduce each step I am getting 10 fps but if I am taking in first step my trained model instead of unpruned default model I have only 7.2 fps. I think something is wrong with my train.txt could yo advise me what is wrong

random_seed: 42

dataset_config {
    data_sources: {
      tfrecords_path: "/workspace/tlt-experiments/data_resolution/tfrecords/kitti_trainval/kitti_trainval*"
       image_directory_path: "/workspace/tlt-experiments/data_resolution/training"
    }
    image_extension: "png"
      target_class_mapping {
          key: "emplo"
          value: "emplo"
      }
    validation_fold: 0
}

model_config {
  pretrained_model_file: "/workspace/tlt-experiments/people_milk/pretrained_peoplenet/tlt_peoplenet_vunpruned_v1.0/resnet34_peoplenet.tlt"
  num_layers: 34
  freeze_blocks: 0
  arch: "resnet"
  use_batch_norm: true
  objective_set {
    bbox {
      scale: 35.0
      offset: 0.5
    }
    cov {
    }
  }
  training_precision {
    backend_floatx: FLOAT32
  }
}

cost_function_config {
  target_classes {
    name: "emplo"
    class_weight: 1.0
    coverage_foreground_weight: 0.0500000007451
    objectives {
      name: "cov"
      initial_weight: 1.0
      weight_target: 1.0
    }
    objectives {
      name: "bbox"
      initial_weight: 10.0
      weight_target: 10.0
    }
  }
  enable_autoweighting: true
  max_objective_weight: 0.999899983406
  min_objective_weight: 9.99999974738e-05
}

training_config {
batch_size_per_gpu: 12
num_epochs: 20
learning_rate {
  soft_start_annealing_schedule {
    min_learning_rate: 5e-06
    max_learning_rate: 0.0005
    soft_start: 0.1
    annealing: 0.7
  }
}
regularizer {
  type: L1
  weight: 3e-09
}
optimizer {
  adam {
    epsilon: 9.9e-09
    beta1: 0.9
    beta2: 0.999
  }
}
cost_scaling {
  initial_exponent: 20.0
  increment: 0.005
  decrement: 1.0
}
checkpoint_interval: 5
}

augmentation_config {
 preprocessing {
 output_image_width: 960
 output_image_height: 544
 output_image_channel: 3
 min_bbox_width: 1.0
 min_bbox_height: 1.0
 }
 spatial_augmentation {
   hflip_probability: 0.5
   zoom_min: 1.0
   zoom_max: 1.0
   translate_max_x: 8.0
   translate_max_y: 8.0
 }
 color_augmentation {
   hue_rotation_max: 25.0
   saturation_shift_max: 0.20000000298
   contrast_scale_max: 0.10000000149
   contrast_center: 0.5
 }
}

postprocessing_config{
 target_class_config{
   key: "emplo"
   value: {
     clustering_config {
       coverage_threshold: 0.005
       dbscan_eps: 0.265
       dbscan_min_samples: 0.05
       minimum_bounding_box_height: 4
     }
   }
 }
}

bbox_rasterizer_config {
  target_class_config {
    key: "emplo"
    value {
      cov_center_x: 0.5
      cov_center_y: 0.5
      cov_radius_x: 0.40000000596
      cov_radius_y: 0.40000000596
      bbox_min_radius: 1.0
    }
  }
  deadzone_radius: 0.400000154972
}


evaluation_config {
 validation_period_during_training: 10
 first_validation_epoch: 1
 minimum_detection_ground_truth_overlap {
   key: "emplo"
   value: 0.5
 }
 evaluation_box_config {
   key: "emplo"
   value {
     minimum_height: 20
     maximum_height: 9999
     minimum_width: 4
     maximum_width: 9999
   }
 }
}

Morganh · June 26, 2020, 3:36pm

What is the size of your pruned model?
Suggest you to prune more.

vladimir.zaigrajew · June 30, 2020, 4:23pm

I tried with 0.00005 and also with 0.5, still the same 7fps and now I will try with 0.000005 and I will see but for now .

Below I printed all pruned model sizes
peoplenet_resnet34_pruned0000005.tlt should be read as peoplenet_resnet34_pruned with threshold 0.000005

25M my_peoplenet_0.005.tlt
 68M peoplenet_resnet34_pruned0000005.tlt
 25M peoplenet_resnet34_pruned000005.tlt
 25M peoplenet_resnet34_pruned0005.tlt
3.1M peoplenet_resnet34_pruned05.tlt

Morganh · June 30, 2020, 5:02pm

How about the fps for 3.1M tlt model?

kayccc · July 14, 2020, 12:58am

There is no update from you for a period, assuming this is not an issue any more.
Hence we are closing this topic. If need further support, please open a new one.
Thanks

Hi vladimir.zaigrajew,

Is this still an issue to support? Any result can be shared? Thanks

Topic		Replies	Views
Lack of FPS after successfully deploy TLT to Deepstream. DeepStream SDK	18	1070	April 27, 2020
Hello i have a question of PeopleNet TAO Toolkit	11	1208	October 12, 2021
Error while executing the fastest RCNN example on the tlt officialy provided docker in my intel computer TAO Toolkit tensorflow , docker	9	1076	October 12, 2021
Probleme with training/pruning tlt TAO Toolkit yolo	10	1045	October 12, 2021
TX2 "INT8 not supported by platform. Trying FP16 mode" TAO Toolkit	11	2851	October 12, 2021
What is the architecture being used in PeopleNet pruned resnet34/resnet18 model? TAO Toolkit	4	992	October 12, 2021
Python App Cutom Model on the Jetson Nano TAO Toolkit	10	1258	October 12, 2021
Use jetson-inference TensorRT to infer from custom models like PeopleNet Jetson Nano tensorrt , jetson-inference	8	910	October 18, 2021
Deepstream_lpr_app runs slowly TAO Toolkit	27	996	November 30, 2021
Cannot Create Mobilenet SSD TRT Engine on Jetson Nano \| [ERROR] UffParser: Unsupported number of graph 0 TAO Toolkit tensorrt	9	702	October 12, 2021

Accelerating Peoplnet with tlt for jetson nano

Related topics