Accelerating Peoplnet with tlt for jetson nano

For your reference, I run below experiment. It shows

  1. the fps of unpruned tlt model : about 3.9fps
  2. the fps of pruned_0.005 tlt model : about 10fps

root@1240072:/workspace/tlt-experiments# cat tlt-export_fp16_unprune.sh
tlt-export detectnet_v2 -m resnet34_peoplenet.tlt
-k tlt_encode
-o resnet34_peoplenet.etlt
–data_type fp16

root@1240072:/workspace/tlt-experiments# sh tlt-export_fp16_unpruned.sh
Using TensorFlow backend.
NOTE: UFF has been tested with TensorFlow 1.14.0.
WARNING: The version of TensorFlow installed on this system is not guaranteed to work with UFF.
2020-06-15 16:11:10,726 [INFO] modulus.export._uff: Modulus patch identity layer in padding inputs.
2020-06-15 16:11:15,902 [INFO] modulus.export._uff: Modulus patch identity layer in padding inputs.
DEBUG [/usr/lib/python2.7/dist-packages/uff/converters/tensorflow/converter.py:96] Marking [‘output_cov/Sigmoid’, ‘output_bbox/BiasAdd’] as outputs
[TensorRT] INFO: Some tactics do not have sufficient workspace memory to run. Increasing workspace size may increase performance, please check verbose output.
[TensorRT] INFO: Detected 1 inputs and 2 output network tensors.

root@1240072:/workspace/tlt-experiments# cat prune_0.005.sh
tlt-prune -m resnet34_peoplenet.tlt -o resnet34_peoplenet_0.005.tlt -pth 0.005 -eq union -k tlt_encode

root@1240072:/workspace/tlt-experiments# sh prune_0.005.sh
Using TensorFlow backend.
2020-06-15 16:18:48,488 [INFO] modulus.pruning.pruning: Exploring graph for retainable indices
2020-06-15 16:18:50,906 [INFO] modulus.pruning.pruning: Pruning model and appending pruned nodes to new graph
2020-06-15 16:20:55,067 [INFO] iva.common.magnet_prune: Pruning ratio (pruned model / original model): 0.143708735585

root@1240072:/workspace/tlt-experiments# cat tlt-export_fp16_0.005.sh
tlt-export detectnet_v2 -m resnet34_peoplenet_0.005.tlt
-k tlt_encode
-o resnet34_peoplenet_0.005.etlt
–data_type fp16

root@1240072:/workspace/tlt-experiments# sh tlt-export_fp16_0.005.sh
Using TensorFlow backend.
NOTE: UFF has been tested with TensorFlow 1.14.0.
WARNING: The version of TensorFlow installed on this system is not guaranteed to work with UFF.
2020-06-15 16:24:57,321 [INFO] modulus.export._uff: Modulus patch identity layer in padding inputs.
2020-06-15 16:25:02,810 [INFO] modulus.export._uff: Modulus patch identity layer in padding inputs.
DEBUG [/usr/lib/python2.7/dist-packages/uff/converters/tensorflow/converter.py:96] Marking [‘output_cov/Sigmoid’, ‘output_bbox/BiasAdd’] as outputs
[TensorRT] INFO: Some tactics do not have sufficient workspace memory to run. Increasing workspace size may increase performance, please check verbose output.
[TensorRT] INFO: Detected 1 inputs and 2 output network tensors.

In nano,

$ cat generate_fp16_engine_unpruned.sh
tlt-converter resnet34_peoplenet.etlt
-k tlt_encode
-o output_bbox/BiasAdd,output_cov/Sigmoid
-d 3,544,960
-i nchw
-m 8 -t fp16
-e resnet34_peoplenet.engine
-b 64
-w 1000000000

$ sh generate_fp16_engine_unpruned.sh

$ /usr/src/tensorrt/bin/trtexec --loadEngine=resnet34_peoplenet.engine --fp16 --batch=1 --useSpinWait
&&&& RUNNING TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --loadEngine=resnet34_peoplenet.engine --fp16 --batch=1 --useSpinWait
[06/16/2020-01:34:41] [I] === Model Options ===
[06/16/2020-01:34:41] [I] Format: *
[06/16/2020-01:34:41] [I] Model:
[06/16/2020-01:34:41] [I] Output:
[06/16/2020-01:34:41] [I] === Build Options ===
[06/16/2020-01:34:41] [I] Max batch: 1
[06/16/2020-01:34:41] [I] Workspace: 16 MB
[06/16/2020-01:34:41] [I] minTiming: 1
[06/16/2020-01:34:41] [I] avgTiming: 8
[06/16/2020-01:34:41] [I] Precision: FP32+FP16
[06/16/2020-01:34:41] [I] Calibration:
[06/16/2020-01:34:41] [I] Safe mode: Disabled
[06/16/2020-01:34:41] [I] Save engine:
[06/16/2020-01:34:41] [I] Load engine: resnet34_peoplenet.engine
[06/16/2020-01:34:41] [I] Builder Cache: Enabled
[06/16/2020-01:34:41] [I] NVTX verbosity: 0
[06/16/2020-01:34:41] [I] Inputs format: fp32:CHW
[06/16/2020-01:34:41] [I] Outputs format: fp32:CHW
[06/16/2020-01:34:41] [I] Input build shapes: model
[06/16/2020-01:34:41] [I] Input calibration shapes: model
[06/16/2020-01:34:41] [I] === System Options ===
[06/16/2020-01:34:41] [I] Device: 0
[06/16/2020-01:34:41] [I] DLACore:
[06/16/2020-01:34:41] [I] Plugins:
[06/16/2020-01:34:41] [I] === Inference Options ===
[06/16/2020-01:34:41] [I] Batch: 1
[06/16/2020-01:34:41] [I] Input inference shapes: model
[06/16/2020-01:34:41] [I] Iterations: 10
[06/16/2020-01:34:41] [I] Duration: 3s (+ 200ms warm up)
[06/16/2020-01:34:41] [I] Sleep time: 0ms
[06/16/2020-01:34:41] [I] Streams: 1
[06/16/2020-01:34:41] [I] ExposeDMA: Disabled
[06/16/2020-01:34:41] [I] Spin-wait: Enabled
[06/16/2020-01:34:41] [I] Multithreading: Disabled
[06/16/2020-01:34:41] [I] CUDA Graph: Disabled
[06/16/2020-01:34:41] [I] Skip inference: Disabled
[06/16/2020-01:34:41] [I] Inputs:
[06/16/2020-01:34:41] [I] === Reporting Options ===
[06/16/2020-01:34:41] [I] Verbose: Disabled
[06/16/2020-01:34:41] [I] Averages: 10 inferences
[06/16/2020-01:34:41] [I] Percentile: 99
[06/16/2020-01:34:41] [I] Dump output: Disabled
[06/16/2020-01:34:41] [I] Profile: Disabled
[06/16/2020-01:34:41] [I] Export timing to JSON file:
[06/16/2020-01:34:41] [I] Export output to JSON file:
[06/16/2020-01:34:41] [I] Export profile to JSON file:
[06/16/2020-01:34:41] [I]
[06/16/2020-01:34:46] [I] Starting inference threads
[06/16/2020-01:34:50] [I] Warmup completed 1 queries over 200 ms
[06/16/2020-01:34:50] [I] Timing trace has 14 queries over 3.58136 s
[06/16/2020-01:34:50] [I] Trace averages of 10 runs:
[06/16/2020-01:34:50] [I] Average on 10 runs - GPU latency: 255.143 ms - Host latency: 255.79 ms (end to end 255.796 ms)
[06/16/2020-01:34:50] [I] Host latency
[06/16/2020-01:34:50] [I] min: 255.455 ms (end to end 255.461 ms)
[06/16/2020-01:34:50] [I] max: 255.986 ms (end to end 255.992 ms)
[06/16/2020-01:34:50] [I] mean: 255.805 ms (end to end 255.811 ms)
[06/16/2020-01:34:50] [I] median: 255.826 ms (end to end 255.833 ms)
[06/16/2020-01:34:50] [I] percentile: 255.986 ms at 99% (end to end 255.992 ms at 99%)
[06/16/2020-01:34:50] [I] throughput: 3.90913 qps
[06/16/2020-01:34:50] [I] walltime: 3.58136 s
[06/16/2020-01:34:50] [I] GPU Compute
[06/16/2020-01:34:50] [I] min: 254.812 ms
[06/16/2020-01:34:50] [I] max: 255.337 ms
[06/16/2020-01:34:50] [I] mean: 255.159 ms
[06/16/2020-01:34:50] [I] median: 255.178 ms
[06/16/2020-01:34:50] [I] percentile: 255.337 ms at 99%
[06/16/2020-01:34:50] [I] total compute time: 3.57223 s
&&&& PASSED TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --loadEngine=resnet34_peoplenet.engine --fp16 --batch=1 --useSpinWait

==> the fps is about 3.9

$ cat generate_fp16_engine_0.005.sh
tlt-converter-7.1 resnet34_peoplenet_0.005.etlt
-k tlt_encode
-o output_bbox/BiasAdd,output_cov/Sigmoid
-d 3,544,960
-i nchw
-m 8 -t fp16
-e resnet34_peoplenet_0.005.engine
-b 64
-w 1000000000

$ sh generate_fp16_engine_0.005.sh

$ /usr/src/tensorrt/bin/trtexec --loadEngine=resnet34_peoplenet_0.005.engine --fp16 --batch=1 --useSpinWait
&&&& RUNNING TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --loadEngine=resnet34_peoplenet_0.005.engine --fp16 --batch=1 --useSpinWait
[06/16/2020-01:57:53] [I] === Model Options ===
[06/16/2020-01:57:53] [I] Format: *
[06/16/2020-01:57:53] [I] Model:
[06/16/2020-01:57:53] [I] Output:
[06/16/2020-01:57:53] [I] === Build Options ===
[06/16/2020-01:57:53] [I] Max batch: 1
[06/16/2020-01:57:53] [I] Workspace: 16 MB
[06/16/2020-01:57:53] [I] minTiming: 1
[06/16/2020-01:57:53] [I] avgTiming: 8
[06/16/2020-01:57:53] [I] Precision: FP32+FP16
[06/16/2020-01:57:53] [I] Calibration:
[06/16/2020-01:57:53] [I] Safe mode: Disabled
[06/16/2020-01:57:53] [I] Save engine:
[06/16/2020-01:57:53] [I] Load engine: resnet34_peoplenet_0.005.engine
[06/16/2020-01:57:53] [I] Builder Cache: Enabled
[06/16/2020-01:57:53] [I] NVTX verbosity: 0
[06/16/2020-01:57:53] [I] Inputs format: fp32:CHW
[06/16/2020-01:57:53] [I] Outputs format: fp32:CHW
[06/16/2020-01:57:53] [I] Input build shapes: model
[06/16/2020-01:57:53] [I] Input calibration shapes: model
[06/16/2020-01:57:53] [I] === System Options ===
[06/16/2020-01:57:53] [I] Device: 0
[06/16/2020-01:57:53] [I] DLACore:
[06/16/2020-01:57:53] [I] Plugins:
[06/16/2020-01:57:53] [I] === Inference Options ===
[06/16/2020-01:57:53] [I] Batch: 1
[06/16/2020-01:57:53] [I] Input inference shapes: model
[06/16/2020-01:57:53] [I] Iterations: 10
[06/16/2020-01:57:53] [I] Duration: 3s (+ 200ms warm up)
[06/16/2020-01:57:53] [I] Sleep time: 0ms
[06/16/2020-01:57:53] [I] Streams: 1
[06/16/2020-01:57:53] [I] ExposeDMA: Disabled
[06/16/2020-01:57:53] [I] Spin-wait: Enabled
[06/16/2020-01:57:53] [I] Multithreading: Disabled
[06/16/2020-01:57:53] [I] CUDA Graph: Disabled
[06/16/2020-01:57:53] [I] Skip inference: Disabled
[06/16/2020-01:57:53] [I] Inputs:
[06/16/2020-01:57:53] [I] === Reporting Options ===
[06/16/2020-01:57:53] [I] Verbose: Disabled
[06/16/2020-01:57:53] [I] Averages: 10 inferences
[06/16/2020-01:57:53] [I] Percentile: 99
[06/16/2020-01:57:53] [I] Dump output: Disabled
[06/16/2020-01:57:53] [I] Profile: Disabled
[06/16/2020-01:57:53] [I] Export timing to JSON file:
[06/16/2020-01:57:53] [I] Export output to JSON file:
[06/16/2020-01:57:53] [I] Export profile to JSON file:
[06/16/2020-01:57:53] [I]
[06/16/2020-01:57:57] [I] Starting inference threads
[06/16/2020-01:58:01] [I] Warmup completed 2 queries over 200 ms
[06/16/2020-01:58:01] [I] Timing trace has 32 queries over 3.22652 s
[06/16/2020-01:58:01] [I] Trace averages of 10 runs:
[06/16/2020-01:58:01] [I] Average on 10 runs - GPU latency: 100.122 ms - Host latency: 100.763 ms (end to end 100.769 ms)
[06/16/2020-01:58:01] [I] Average on 10 runs - GPU latency: 100.161 ms - Host latency: 100.803 ms (end to end 100.809 ms)
[06/16/2020-01:58:01] [I] Average on 10 runs - GPU latency: 100.253 ms - Host latency: 100.896 ms (end to end 100.902 ms)
[06/16/2020-01:58:01] [I] Host latency
[06/16/2020-01:58:01] [I] min: 100.557 ms (end to end 100.562 ms)
[06/16/2020-01:58:01] [I] max: 102.077 ms (end to end 102.084 ms)
[06/16/2020-01:58:01] [I] mean: 100.822 ms (end to end 100.828 ms)
[06/16/2020-01:58:01] [I] median: 100.73 ms (end to end 100.737 ms)
[06/16/2020-01:58:01] [I] percentile: 102.077 ms at 99% (end to end 102.084 ms at 99%)
[06/16/2020-01:58:01] [I] throughput: 9.91782 qps
[06/16/2020-01:58:01] [I] walltime: 3.22652 s
[06/16/2020-01:58:01] [I] GPU Compute
[06/16/2020-01:58:01] [I] min: 99.9324 ms
[06/16/2020-01:58:01] [I] max: 101.429 ms
[06/16/2020-01:58:01] [I] mean: 100.181 ms
[06/16/2020-01:58:01] [I] median: 100.091 ms
[06/16/2020-01:58:01] [I] percentile: 101.429 ms at 99%
[06/16/2020-01:58:01] [I] total compute time: 3.20579 s
&&&& PASSED TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --loadEngine=resnet34_peoplenet_0.005.engine --fp16 --batch=1 --useSpinWait

==> the fps is about 10