Nvidia/retinanet-examples Network running VERY slow on Jetson Xavier

Hello,

This is a follow-on post from a different thread, but thought it made more sense to start a new thread.

I’ve been using the repo at https://github.com/NVIDIA/retinanet-examples to attempt to develop a retinanet-based network to run in Deepstream on a Jetson Xavier AGX.

In short, after going through the pth->onnx->plan-model process, and my results are that the network runs in the deepstream-app but VERY slowly (<1 fps). So, I don’t know if it’s a problem with my model, the deepstream config files, or that the RetinaNet model is just too heavy for the Jetson Xavier.

If you’re interested, here’s a link to download a zip file containing the pth file resulting from the training, the onnx file from the converted pth, and the TRT (plan) file from the onnx->tensorrt conversion.

https://drive.google.com/file/d/1x_sE7eb564NCqIujcmiao1-2IQUFDdZw/view?usp=sharing

Here is the process I followed:

1 - (on Linux Host, inside docker container from retinanet-examples) train network using the code and process from Nvidia/retinanet-examples github repo.

retinanet train face.pth --fine-tune retinanet_rn50fpn.pth --backbone ResNet50FPN  --classes 1 --iters 10000 --val-iters 1000 --lr 0.0005 --images /workspace  --annotations train.json --val-annotations test.json

2 - (on Linux Host, inside docker container from retinanet-examples) convert the resulting .pth file to onnx using

retinanet export face.pth face.onnx

3 - (on Jetson) export onnx to TRT - THIS TAKES OVER 26 minutes!!!

./export face4.onnx face4.plan

4 - (on Jetson) following instructions in Nvidia/retinanet-examples/Readme, edit deepstream config files, build output processing plugin, and run deepstream-app

Here is the deepstream config file (ds_config_1vid.txt):

# Copyright (c) 2018 NVIDIA Corporation.  All rights reserved.
#
# NVIDIA Corporation and its licensors retain all intellectual property
# and proprietary rights in and to this software, related documentation
# and any modifications thereto.  Any use, reproduction, disclosure or
# distribution of this software and related documentation without an express
# license agreement from NVIDIA Corporation is strictly prohibited.

[application]
enable-perf-measurement=1
perf-measurement-interval-sec=1

[tiled-display]
enable=0
rows=1
columns=1
width=1280
height=720
gpu-id=0

[source0]
enable=1
type=2
num-sources=1
uri=file:/xavier_ssd/sample_1080p_h264.mp4
gpu-id=0

[streammux]
gpu-id=0
batch-size=1
#batched-push-timeout=-1
## Set muxer output width and height
#width=1280
#height=720
width=640
height=480
#cuda-memory-type=1
enable-padding=1

[sink0]
enable=1
type=3
#1=mp4 2=mkv
container=1
#1=h264 2=h265 3=mpeg4
## only SW mpeg4 is supported right now.
codec=1
sync=0
bitrate=80000000
output-file=/xavier_ssd/output.mp4
source-id=0

[sink1]
enable=0
#Type - 1=FakeSink 2=EglSink 3=File
type=2
sync=1
source-id=0
gpu-id=0
#cuda-memory-type=1


[osd]
enable=1
gpu-id=0
border-width=2
text-size=12
text-color=1;1;1;1;
text-bg-color=0.3;0.3;0.3;1
font=Arial
show-clock=0
clock-x-offset=800
clock-y-offset=820
clock-text-size=12
clock-color=1;0;0;0

[primary-gie]
enable=1
gpu-id=0
batch-size=1
gie-unique-id=1
interval=0
labelfile-path=labels_coco.txt
#model-engine-file=/xavier_ssd/face.plan
config-file=infer_config_batch1.txt

Here’s the inference engine config file (infer_config_batch1.txt):

# Copyright (c) 2018 NVIDIA Corporation.  All rights reserved.
# NVIDIA Corporation and its licensors retain all intellectual property
# and proprietary rights in and to this software, related documentation
# and any modifications thereto.  Any use, reproduction, disclosure or
# distribution of this software and related documentation without an express
# license agreement from NVIDIA Corporation is strictly prohibited.

# Following properties are mandatory when engine files are not specified:
#   int8-calib-file(Only in INT8)
#   Caffemodel mandatory properties: model-file, proto-file, output-blob-names
#   UFF: uff-file, input-dims, uff-input-blob-name, output-blob-names
#   ONNX: onnx-file
#
# Mandatory properties for detectors:
#   parse-func, num-detected-classes,
#   custom-lib-path (when parse-func=0 i.e. custom),
#   parse-bbox-func-name (when parse-func=0)
#
# Optional properties for detectors:
#   enable-dbscan(Default=false), interval(Primary mode only, Default=0)
#
# Mandatory properties for classifiers:
#   classifier-threshold, is-classifier
#
# Optional properties for classifiers:
#   classifier-async-mode(Secondary mode only, Default=false)
#
# Optional properties in secondary mode:
#   operate-on-gie-id(Default=0), operate-on-class-ids(Defaults to all classes),
#   input-object-min-width, input-object-min-height, input-object-max-width,
#   input-object-max-height
#
# Following properties are always recommended:
#   batch-size(Default=1)
#
# Other optional properties:
#   net-scale-factor(Default=1), network-mode(Default=0 i.e FP32),
#   model-color-format(Default=0 i.e. RGB) model-engine-file, labelfile-path,
#   mean-file, gie-unique-id(Default=0), offsets, gie-mode (Default=1 i.e. primary),
#   custom-lib-path, network-mode(Default=0 i.e FP32)
#
# The values in the config file are overridden by values set through GObject
# properties.

[property]
gpu-id=0
net-scale-factor=0.017352074
offsets=123.675;116.28;103.53
model-engine-file=/xavier_ssd/face.plan
labelfile-path=labels_coco.txt
batch-size=1
## 0=FP32, 1=INT8, 2=FP16 mode
network-mode=2
num-detected-classes=1
interval=0
gie-unique-id=1
parse-func=0
is-classifier=0
output-blob-names=boxes;scores;classes
parse-bbox-func-name=NvDsInferParseRetinaNet
custom-lib-path=build/libnvdsparsebbox_retinanet.so
#enable-dbscan=1


[class-attrs-all]
threshold=0.5
group-threshold=0
## Set eps=0.7 and minBoxes for enable-dbscan=1
#eps=0.2
##minBoxes=3
#roi-top-offset=0
#roi-bottom-offset=0
detected-min-w=4
detected-min-h=4
#detected-max-w=0
#detected-max-h=0

## Per class configuration
#[class-attrs-2]
#threshold=0.6
#eps=0.5
#group-threshold=3
#roi-top-offset=20
#roi-bottom-offset=10
#detected-min-w=40
#detected-min-h=40
#detected-max-w=400
#detected-max-h=800

Here’s the command to run deepstream:

LD_PRELOAD=libnvdsparsebbox_retinanet.so deepstream-app -c ds_config_1vid.txt

Here’s the output (at least the first part of it):

$ LD_PRELOAD=build/libnvdsparsebbox_retinanet.so deepstream-app -c ds_config_1vid.txt 
Unknown key 'parse-func' for group [property]
Opening in BLOCKING MODE 
Creating LL OSD context new

Runtime commands:
	h: Print this help
	q: Quit

	p: Pause
	r: Resume


**PERF: FPS 0 (Avg)	
**PERF: 0.00 (0.00)	
** INFO: <bus_callback:189>: Pipeline ready

Opening in BLOCKING MODE 
NvMMLiteOpen : Block : BlockType = 261 
NVMEDIA: Reading vendor.tegra.display-size : status: 6 
NvMMLiteBlockCreate : Block : BlockType = 261 
** INFO: <bus_callback:175>: Pipeline running

Creating LL OSD context new
NvMMLiteOpen : Block : BlockType = 4 
===== NVMEDIA: NVENC =====
NvMMLiteBlockCreate : Block : BlockType = 4 
H264: Profile = 66, Level = 0 
**PERF: 0.00 (0.00)	
**PERF: 5.15 (5.15)	
**PERF: 5.23 (5.18)	
**PERF: 5.21 (5.19)	
**PERF: 5.18 (5.19)	
**PERF: 5.21 (5.19)	
**PERF: 5.22 (5.20)	
**PERF: 5.18 (5.20)	
**PERF: 5.23 (5.20)	
**PERF: 5.19 (5.20)	
**PERF: 5.18 (5.20)	
**PERF: 5.22 (5.20)	
**PERF: 5.20 (5.20)	
**PERF: 5.20 (5.20)	
**PERF: 5.23 (5.20)	
**PERF: 5.23 (5.20)	
**PERF: 5.09 (5.20)	
**PERF: 4.59 (5.17)	
**PERF: 5.20 (5.17)	

**PERF: FPS 0 (Avg)	
**PERF: 5.19 (5.17)	
**PERF: 5.24 (5.17)	
**PERF: 5.21 (5.17)	
**PERF: 5.11 (5.17)	
**PERF: 4.91 (5.16)

Bottom Line Question: Why so slow?

Surely this RetinaNet network will run faster than 1 fps on the Xavier!

Thanks for any help you can provide.

Test the face4.onnx shared by you using trtexec on my Xavier and see the fps is 1000/344 for fp32 and 1000/93 for fp16 and 1000/52 for int8, I would suggest you to use fp16 or even int8 model to do inference, you can see the below log for the test result.
BTW, where do you see the fps is 1, could you share more details about that?

bcao@bcao-desktop:~$ /usr/src/tensorrt/bin/trtexec --onnx=face4.onnx 
&&&& RUNNING TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --onnx=face4.onnx
[I] onnx: face4.onnx
----------------------------------------------------------------
Input filename:   face4.onnx
ONNX IR version:  0.0.4
Opset version:    9
Producer name:    pytorch
Producer version: 1.2
Domain:           
Model version:    0
Doc string:       
----------------------------------------------------------------
WARNING: ONNX model has a newer ir_version (0.0.4) than this parser was built against (0.0.3).
sudo[I] Average over 10 runs is 346.869 ms (host walltime is 347.003 ms, 99% percentile time is 349.79).
[I] Average over 10 runs is 347.362 ms (host walltime is 347.468 ms, 99% percentile time is 360.275).
[I] Average over 10 runs is 350.055 ms (host walltime is 350.184 ms, 99% percentile time is 372.547).
[I] Average over 10 runs is 347.814 ms (host walltime is 347.931 ms, 99% percentile time is 370.035).
[I] Average over 10 runs is 346.898 ms (host walltime is 346.996 ms, 99% percentile time is 360.717).
[I] Average over 10 runs is 344.869 ms (host walltime is 344.967 ms, 99% percentile time is 346.129).
[I] Average over 10 runs is 345.243 ms (host walltime is 345.336 ms, 99% percentile time is 346.731).
[I] Average over 10 runs is 344.528 ms (host walltime is 344.612 ms, 99% percentile time is 345.678).
[I] Average over 10 runs is 344.022 ms (host walltime is 344.107 ms, 99% percentile time is 344.469).
[I] Average over 10 runs is 344.453 ms (host walltime is 344.544 ms, 99% percentile time is 344.899).
&&&& PASSED TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --onnx=face4.onnx
bcao@bcao-desktop:~$ /usr/src/tensorrt/bin/trtexec --onnx=face4.onnx --fp16
&&&& RUNNING TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --onnx=face4.onnx --fp16
[I] onnx: face4.onnx
[I] fp16
----------------------------------------------------------------
Input filename:   face4.onnx
ONNX IR version:  0.0.4
Opset version:    9
Producer name:    pytorch
Producer version: 1.2
Domain:           
Model version:    0
Doc string:       
----------------------------------------------------------------
WARNING: ONNX model has a newer ir_version (0.0.4) than this parser was built against (0.0.3).
[I] Average over 10 runs is 93.7664 ms (host walltime is 93.8747 ms, 99% percentile time is 94.2441).
[I] Average over 10 runs is 93.6466 ms (host walltime is 93.7247 ms, 99% percentile time is 93.7568).
[I] Average over 10 runs is 93.7428 ms (host walltime is 93.8356 ms, 99% percentile time is 93.929).
[I] Average over 10 runs is 93.8257 ms (host walltime is 93.9256 ms, 99% percentile time is 93.9769).
[I] Average over 10 runs is 93.8853 ms (host walltime is 93.9771 ms, 99% percentile time is 94.1207).
[I] Average over 10 runs is 93.8356 ms (host walltime is 93.927 ms, 99% percentile time is 93.9305).
[I] Average over 10 runs is 93.8523 ms (host walltime is 93.9399 ms, 99% percentile time is 93.9562).
[I] Average over 10 runs is 93.8054 ms (host walltime is 93.8941 ms, 99% percentile time is 93.927).
[I] Average over 10 runs is 93.8223 ms (host walltime is 93.9027 ms, 99% percentile time is 93.9575).
[I] Average over 10 runs is 93.8284 ms (host walltime is 93.9021 ms, 99% percentile time is 93.9763).
&&&& PASSED TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --onnx=face4.onnx --fp16
bcao@bcao-desktop:~$ /usr/src/tensorrt/bin/trtexec --onnx=face4.onnx --int8
&&&& RUNNING TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --onnx=face4.onnx --int8
[I] onnx: face4.onnx
[I] int8
----------------------------------------------------------------
Input filename:   face4.onnx
ONNX IR version:  0.0.4
Opset version:    9
Producer name:    pytorch
Producer version: 1.2
Domain:           
Model version:    0
Doc string:       
----------------------------------------------------------------
WARNING: ONNX model has a newer ir_version (0.0.4) than this parser was built against (0.0.3).
[I] Average over 10 runs is 52.8027 ms (host walltime is 52.9207 ms, 99% percentile time is 52.9483).
[I] Average over 10 runs is 52.7385 ms (host walltime is 52.8257 ms, 99% percentile time is 52.9076).
[I] Average over 10 runs is 52.7454 ms (host walltime is 52.8259 ms, 99% percentile time is 52.8582).
[I] Average over 10 runs is 52.6919 ms (host walltime is 52.7741 ms, 99% percentile time is 52.7704).
[I] Average over 10 runs is 52.7077 ms (host walltime is 52.7981 ms, 99% percentile time is 52.8256).
[I] Average over 10 runs is 52.7182 ms (host walltime is 52.8011 ms, 99% percentile time is 52.8459).
[I] Average over 10 runs is 52.7461 ms (host walltime is 52.8267 ms, 99% percentile time is 52.8957).
[I] Average over 10 runs is 52.6918 ms (host walltime is 52.7808 ms, 99% percentile time is 52.8028).
[I] Average over 10 runs is 52.7374 ms (host walltime is 52.8145 ms, 99% percentile time is 52.8012).
[I] Average over 10 runs is 52.7022 ms (host walltime is 52.7841 ms, 99% percentile time is 52.7861).
&&&& PASSED TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --onnx=face4.onnx --int8

There must be something different about my Xavier or installations or something. Here’s what I get when I test as you did. I get something similar whether fp32, fp16, or int8.

$ /usr/src/tensorrt/bin/trtexec --onnx=face4.onnx --fp16
&&&& RUNNING TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --onnx=face4.onnx --fp16
[00/21/2020-10:14:43] [I] === Model Options ===
[00/21/2020-10:14:43] [I] Format: ONNX
[00/21/2020-10:14:43] [I] Model: face4.onnx
[00/21/2020-10:14:43] [I] Output:
[00/21/2020-10:14:43] [I] === Build Options ===
[00/21/2020-10:14:43] [I] Max batch: 1
[00/21/2020-10:14:43] [I] Workspace: 16 MB
[00/21/2020-10:14:43] [I] minTiming: 1
[00/21/2020-10:14:43] [I] avgTiming: 8
[00/21/2020-10:14:43] [I] Precision: FP16
[00/21/2020-10:14:43] [I] Calibration: 
[00/21/2020-10:14:43] [I] Safe mode: Disabled
[00/21/2020-10:14:43] [I] Save engine: 
[00/21/2020-10:14:43] [I] Load engine: 
[00/21/2020-10:14:43] [I] Inputs format: fp32:CHW
[00/21/2020-10:14:43] [I] Outputs format: fp32:CHW
[00/21/2020-10:14:43] [I] Input build shapes: model
[00/21/2020-10:14:43] [I] === System Options ===
[00/21/2020-10:14:43] [I] Device: 0
[00/21/2020-10:14:43] [I] DLACore: 
[00/21/2020-10:14:43] [I] Plugins:
[00/21/2020-10:14:43] [I] === Inference Options ===
[00/21/2020-10:14:43] [I] Batch: 1
[00/21/2020-10:14:43] [I] Iterations: 10 (200 ms warm up)
[00/21/2020-10:14:43] [I] Duration: 10s
[00/21/2020-10:14:43] [I] Sleep time: 0ms
[00/21/2020-10:14:43] [I] Streams: 1
[00/21/2020-10:14:43] [I] Spin-wait: Disabled
[00/21/2020-10:14:43] [I] Multithreading: Enabled
[00/21/2020-10:14:43] [I] CUDA Graph: Disabled
[00/21/2020-10:14:43] [I] Skip inference: Disabled
[00/21/2020-10:14:43] [I] Input inference shapes: model
[00/21/2020-10:14:43] [I] === Reporting Options ===
[00/21/2020-10:14:43] [I] Verbose: Disabled
[00/21/2020-10:14:43] [I] Averages: 10 inferences
[00/21/2020-10:14:43] [I] Percentile: 99
[00/21/2020-10:14:43] [I] Dump output: Disabled
[00/21/2020-10:14:43] [I] Profile: Disabled
[00/21/2020-10:14:43] [I] Export timing to JSON file: 
[00/21/2020-10:14:43] [I] Export profile to JSON file: 
[00/21/2020-10:14:43] [I] 
----------------------------------------------------------------
Input filename:   face4.onnx
ONNX IR version:  0.0.4
Opset version:    9
Producer name:    pytorch
Producer version: 1.2
Domain:           
Model version:    0
Doc string:       
----------------------------------------------------------------
WARNING: ONNX model has a newer ir_version (0.0.4) than this parser was built against (0.0.3).
[00/21/2020-10:14:45] [I] [TRT] 
[00/21/2020-10:14:45] [I] [TRT] --------------- Layers running on DLA: 
[00/21/2020-10:14:45] [I] [TRT] 
[00/21/2020-10:14:45] [I] [TRT] --------------- Layers running on GPU: 
[00/21/2020-10:14:45] [I] [TRT] (Unnamed Layer* 0) [Convolution] + (Unnamed Layer* 2) [Activation], (Unnamed Layer* 3) [Pooling], (Unnamed Layer* 4) [Convolution] + (Unnamed Layer* 6) [Activation], (Unnamed Layer* 7) [Convolution] + (Unnamed Layer* 9) [Activation], (Unnamed Layer* 12) [Convolution], (Unnamed Layer* 10) [Convolution] + (Unnamed Layer* 14) [ElementWise] + (Unnamed Layer* 15) [Activation], (Unnamed Layer* 16) [Convolution] + (Unnamed Layer* 18) [Activation], (Unnamed Layer* 19) [Convolution] + (Unnamed Layer* 21) [Activation], (Unnamed Layer* 22) [Convolution] + (Unnamed Layer* 24) [ElementWise] + (Unnamed Layer* 25) [Activation], (Unnamed Layer* 26) [Convolution] + (Unnamed Layer* 28) [Activation], (Unnamed Layer* 29) [Convolution] + (Unnamed Layer* 31) [Activation], (Unnamed Layer* 32) [Convolution] + (Unnamed Layer* 34) [ElementWise] + (Unnamed Layer* 35) [Activation], (Unnamed Layer* 36) [Convolution] + (Unnamed Layer* 38) [Activation], (Unnamed Layer* 39) [Convolution] + (Unnamed Layer* 41) [Activation], (Unnamed Layer* 44) [Convolution], (Unnamed Layer* 42) [Convolution] + (Unnamed Layer* 46) [ElementWise] + (Unnamed Layer* 47) [Activation], (Unnamed Layer* 48) [Convolution] + (Unnamed Layer* 50) [Activation], (Unnamed Layer* 51) [Convolution] + (Unnamed Layer* 53) [Activation], (Unnamed Layer* 54) [Convolution] + (Unnamed Layer* 56) [ElementWise] + (Unnamed Layer* 57) [Activation], (Unnamed Layer* 58) [Convolution] + (Unnamed Layer* 60) [Activation], (Unnamed Layer* 61) [Convolution] + (Unnamed Layer* 63) [Activation], (Unnamed Layer* 64) [Convolution] + (Unnamed Layer* 66) [ElementWise] + (Unnamed Layer* 67) [Activation], (Unnamed Layer* 68) [Convolution] + (Unnamed Layer* 70) [Activation], (Unnamed Layer* 71) [Convolution] + (Unnamed Layer* 73) [Activation], (Unnamed Layer* 74) [Convolution] + (Unnamed Layer* 76) [ElementWise] + (Unnamed Layer* 77) [Activation], (Unnamed Layer* 78) [Convolution] + (Unnamed Layer* 80) [Activation], (Unnamed Layer* 81) [Convolution] + (Unnamed Layer* 83) [Activation], (Unnamed Layer* 86) [Convolution], (Unnamed Layer* 84) [Convolution] + (Unnamed Layer* 88) [ElementWise] + (Unnamed Layer* 89) [Activation], (Unnamed Layer* 90) [Convolution] + (Unnamed Layer* 92) [Activation], (Unnamed Layer* 93) [Convolution] + (Unnamed Layer* 95) [Activation], (Unnamed Layer* 96) [Convolution] + (Unnamed Layer* 98) [ElementWise] + (Unnamed Layer* 99) [Activation], (Unnamed Layer* 100) [Convolution] + (Unnamed Layer* 102) [Activation], (Unnamed Layer* 103) [Convolution] + (Unnamed Layer* 105) [Activation], (Unnamed Layer* 106) [Convolution] + (Unnamed Layer* 108) [ElementWise] + (Unnamed Layer* 109) [Activation], (Unnamed Layer* 110) [Convolution] + (Unnamed Layer* 112) [Activation], (Unnamed Layer* 113) [Convolution] + (Unnamed Layer* 115) [Activation], (Unnamed Layer* 116) [Convolution] + (Unnamed Layer* 118) [ElementWise] + (Unnamed Layer* 119) [Activation], (Unnamed Layer* 120) [Convolution] + (Unnamed Layer* 122) [Activation], (Unnamed Layer* 123) [Convolution] + (Unnamed Layer* 125) [Activation], (Unnamed Layer* 126) [Convolution] + (Unnamed Layer* 128) [ElementWise] + (Unnamed Layer* 129) [Activation], (Unnamed Layer* 130) [Convolution] + (Unnamed Layer* 132) [Activation], (Unnamed Layer* 133) [Convolution] + (Unnamed Layer* 135) [Activation], (Unnamed Layer* 136) [Convolution] + (Unnamed Layer* 138) [ElementWise] + (Unnamed Layer* 139) [Activation], (Unnamed Layer* 140) [Convolution] + (Unnamed Layer* 142) [Activation], (Unnamed Layer* 143) [Convolution] + (Unnamed Layer* 145) [Activation], (Unnamed Layer* 148) [Convolution], (Unnamed Layer* 146) [Convolution] + (Unnamed Layer* 150) [ElementWise] + (Unnamed Layer* 151) [Activation], (Unnamed Layer* 152) [Convolution] + (Unnamed Layer* 154) [Activation], (Unnamed Layer* 155) [Convolution] + (Unnamed Layer* 157) [Activation], (Unnamed Layer* 158) [Convolution] + (Unnamed Layer* 160) [ElementWise] + (Unnamed Layer* 161) [Activation], (Unnamed Layer* 162) [Convolution] + (Unnamed Layer* 164) [Activation], (Unnamed Layer* 165) [Convolution] + (Unnamed Layer* 167) [Activation], (Unnamed Layer* 168) [Convolution] + (Unnamed Layer* 170) [ElementWise] + (Unnamed Layer* 171) [Activation], (Unnamed Layer* 172) [Convolution], (Unnamed Layer* 174) [Resize], (Unnamed Layer* 173) [Convolution] + (Unnamed Layer* 175) [ElementWise], (Unnamed Layer* 177) [Resize], (Unnamed Layer* 176) [Convolution] + (Unnamed Layer* 178) [ElementWise], (Unnamed Layer* 179) [Convolution], (Unnamed Layer* 180) [Activation], (Unnamed Layer* 181) [Convolution], (Unnamed Layer* 182) [Convolution], (Unnamed Layer* 183) [Convolution], (Unnamed Layer* 184) [Convolution], (Unnamed Layer* 185) [Convolution] + (Unnamed Layer* 186) [Activation] || (Unnamed Layer* 230) [Convolution] + (Unnamed Layer* 231) [Activation], (Unnamed Layer* 187) [Convolution] + (Unnamed Layer* 188) [Activation], (Unnamed Layer* 189) [Convolution] + (Unnamed Layer* 190) [Activation], (Unnamed Layer* 191) [Convolution] + (Unnamed Layer* 192) [Activation], (Unnamed Layer* 193) [Convolution], (Unnamed Layer* 194) [Convolution] + (Unnamed Layer* 195) [Activation] || (Unnamed Layer* 239) [Convolution] + (Unnamed Layer* 240) [Activation], (Unnamed Layer* 196) [Convolution] + (Unnamed Layer* 197) [Activation], (Unnamed Layer* 198) [Convolution] + (Unnamed Layer* 199) [Activation], (Unnamed Layer* 200) [Convolution] + (Unnamed Layer* 201) [Activation], (Unnamed Layer* 202) [Convolution], (Unnamed Layer* 203) [Convolution] + (Unnamed Layer* 204) [Activation] || (Unnamed Layer* 248) [Convolution] + (Unnamed Layer* 249) [Activation], (Unnamed Layer* 205) [Convolution] + (Unnamed Layer* 206) [Activation], (Unnamed Layer* 207) [Convolution] + (Unnamed Layer* 208) [Activation], (Unnamed Layer* 209) [Convolution] + (Unnamed Layer* 210) [Activation], (Unnamed Layer* 211) [Convolution], (Unnamed Layer* 212) [Convolution] + (Unnamed Layer* 213) [Activation] || (Unnamed Layer* 257) [Convolution] + (Unnamed Layer* 258) [Activation], (Unnamed Layer* 214) [Convolution] + (Unnamed Layer* 215) [Activation], (Unnamed Layer* 216) [Convolution] + (Unnamed Layer* 217) [Activation], (Unnamed Layer* 218) [Convolution] + (Unnamed Layer* 219) [Activation], (Unnamed Layer* 220) [Convolution], (Unnamed Layer* 221) [Convolution] + (Unnamed Layer* 222) [Activation] || (Unnamed Layer* 266) [Convolution] + (Unnamed Layer* 267) [Activation], (Unnamed Layer* 223) [Convolution] + (Unnamed Layer* 224) [Activation], (Unnamed Layer* 225) [Convolution] + (Unnamed Layer* 226) [Activation], (Unnamed Layer* 227) [Convolution] + (Unnamed Layer* 228) [Activation], (Unnamed Layer* 229) [Convolution], (Unnamed Layer* 232) [Convolution] + (Unnamed Layer* 233) [Activation], (Unnamed Layer* 234) [Convolution] + (Unnamed Layer* 235) [Activation], (Unnamed Layer* 236) [Convolution] + (Unnamed Layer* 237) [Activation], (Unnamed Layer* 238) [Convolution], (Unnamed Layer* 241) [Convolution] + (Unnamed Layer* 242) [Activation], (Unnamed Layer* 243) [Convolution] + (Unnamed Layer* 244) [Activation], (Unnamed Layer* 245) [Convolution] + (Unnamed Layer* 246) [Activation], (Unnamed Layer* 247) [Convolution], (Unnamed Layer* 250) [Convolution] + (Unnamed Layer* 251) [Activation], (Unnamed Layer* 252) [Convolution] + (Unnamed Layer* 253) [Activation], (Unnamed Layer* 254) [Convolution] + (Unnamed Layer* 255) [Activation], (Unnamed Layer* 256) [Convolution], (Unnamed Layer* 259) [Convolution] + (Unnamed Layer* 260) [Activation], (Unnamed Layer* 261) [Convolution] + (Unnamed Layer* 262) [Activation], (Unnamed Layer* 263) [Convolution] + (Unnamed Layer* 264) [Activation], (Unnamed Layer* 265) [Convolution], (Unnamed Layer* 268) [Convolution] + (Unnamed Layer* 269) [Activation], (Unnamed Layer* 270) [Convolution] + (Unnamed Layer* 271) [Activation], (Unnamed Layer* 272) [Convolution] + (Unnamed Layer* 273) [Activation], (Unnamed Layer* 274) [Convolution], (Unnamed Layer* 275) [Activation], (Unnamed Layer* 276) [Activation], (Unnamed Layer* 277) [Activation], (Unnamed Layer* 278) [Activation], (Unnamed Layer* 279) [Activation], 
[00/21/2020-10:14:51] [I] [TRT] Some tactics do not have sufficient workspace memory to run. Increasing workspace size may increase performance, please check verbose output.

After looking at this further, it seems that I have at least 2 problems:

  1. my onnx model was created using version 0.0.4, but my tensorrt (from JetPack 4.3) has a parser built against onnx 0.0.3. Thus creating the message in line 59 of the previous comment (or line 54, below).

  2. by default (on my system), there is not enough workspace memory to run. It appears that I can get rid of that message (line 60 above) by adding --workspace=256 to the trtexec runtime options,

/usr/src/tensorrt/bin/trtexec --onnx=face4.onnx --fp16 --saveEngine=/xavier_ssd/face4_fp16.plan --workspace=256

Output now is:

$ /usr/src/tensorrt/bin/trtexec --onnx=face4.onnx --fp16 --saveEngine=/xavier_ssd/face4_fp16.plan --workspace=640
&&&& RUNNING TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --onnx=face4.onnx --fp16 --saveEngine=/xavier_ssd/face4_fp16.plan --workspace=640
[00/21/2020-12:53:15] [I] === Model Options ===
[00/21/2020-12:53:15] [I] Format: ONNX
[00/21/2020-12:53:15] [I] Model: face4.onnx
[00/21/2020-12:53:15] [I] Output:
[00/21/2020-12:53:15] [I] === Build Options ===
[00/21/2020-12:53:15] [I] Max batch: 1
[00/21/2020-12:53:15] [I] Workspace: 640 MB
[00/21/2020-12:53:15] [I] minTiming: 1
[00/21/2020-12:53:15] [I] avgTiming: 8
[00/21/2020-12:53:15] [I] Precision: FP16
[00/21/2020-12:53:15] [I] Calibration: 
[00/21/2020-12:53:15] [I] Safe mode: Disabled
[00/21/2020-12:53:15] [I] Save engine: /xavier_ssd/face4_fp16.plan
[00/21/2020-12:53:15] [I] Load engine: 
[00/21/2020-12:53:15] [I] Inputs format: fp32:CHW
[00/21/2020-12:53:15] [I] Outputs format: fp32:CHW
[00/21/2020-12:53:15] [I] Input build shapes: model
[00/21/2020-12:53:15] [I] === System Options ===
[00/21/2020-12:53:15] [I] Device: 0
[00/21/2020-12:53:15] [I] DLACore: 
[00/21/2020-12:53:15] [I] Plugins:
[00/21/2020-12:53:15] [I] === Inference Options ===
[00/21/2020-12:53:15] [I] Batch: 1
[00/21/2020-12:53:15] [I] Iterations: 10 (200 ms warm up)
[00/21/2020-12:53:15] [I] Duration: 10s
[00/21/2020-12:53:15] [I] Sleep time: 0ms
[00/21/2020-12:53:15] [I] Streams: 1
[00/21/2020-12:53:15] [I] Spin-wait: Disabled
[00/21/2020-12:53:15] [I] Multithreading: Enabled
[00/21/2020-12:53:15] [I] CUDA Graph: Disabled
[00/21/2020-12:53:15] [I] Skip inference: Disabled
[00/21/2020-12:53:15] [I] Input inference shapes: model
[00/21/2020-12:53:15] [I] === Reporting Options ===
[00/21/2020-12:53:15] [I] Verbose: Disabled
[00/21/2020-12:53:15] [I] Averages: 10 inferences
[00/21/2020-12:53:15] [I] Percentile: 99
[00/21/2020-12:53:15] [I] Dump output: Disabled
[00/21/2020-12:53:15] [I] Profile: Disabled
[00/21/2020-12:53:15] [I] Export timing to JSON file: 
[00/21/2020-12:53:15] [I] Export profile to JSON file: 
[00/21/2020-12:53:15] [I] 
----------------------------------------------------------------
Input filename:   face4.onnx
ONNX IR version:  0.0.4
Opset version:    9
Producer name:    pytorch
Producer version: 1.2
Domain:           
Model version:    0
Doc string:       
----------------------------------------------------------------
WARNING: ONNX model has a newer ir_version (0.0.4) than this parser was built against (0.0.3).
[00/21/2020-12:53:17] [I] [TRT] 
[00/21/2020-12:53:17] [I] [TRT] --------------- Layers running on DLA: 
[00/21/2020-12:53:17] [I] [TRT] 
[00/21/2020-12:53:17] [I] [TRT] --------------- Layers running on GPU: 
[00/21/2020-12:53:17] [I] [TRT] (Unnamed Layer* 0) [Convolution] + (Unnamed Layer* 2) [Activation], (Unnamed Layer* 3) [Pooling], (Unnamed Layer* 4) [Convolution] + (Unnamed Layer* 6) [Activation], (Unnamed Layer* 7) [Convolution] + (Unnamed Layer* 9) [Activation], (Unnamed Layer* 12) [Convolution], (Unnamed Layer* 10) [Convolution] + (Unnamed Layer* 14) [ElementWise] + (Unnamed Layer* 15) [Activation], (Unnamed Layer* 16) [Convolution] + (Unnamed Layer* 18) [Activation], (Unnamed Layer* 19) [Convolution] + (Unnamed Layer* 21) [Activation], (Unnamed Layer* 22) [Convolution] + (Unnamed Layer* 24) [ElementWise] + (Unnamed Layer* 25) [Activation], (Unnamed Layer* 26) [Convolution] + (Unnamed Layer* 28) [Activation], (Unnamed Layer* 29) [Convolution] + (Unnamed Layer* 31) [Activation], (Unnamed Layer* 32) [Convolution] + (Unnamed Layer* 34) [ElementWise] + (Unnamed Layer* 35) [Activation], (Unnamed Layer* 36) [Convolution] + (Unnamed Layer* 38) [Activation], (Unnamed Layer* 39) [Convolution] + (Unnamed Layer* 41) [Activation], (Unnamed Layer* 44) [Convolution], (Unnamed Layer* 42) [Convolution] + (Unnamed Layer* 46) [ElementWise] + (Unnamed Layer* 47) [Activation], (Unnamed Layer* 48) [Convolution] + (Unnamed Layer* 50) [Activation], (Unnamed Layer* 51) [Convolution] + (Unnamed Layer* 53) [Activation], (Unnamed Layer* 54) [Convolution] + (Unnamed Layer* 56) [ElementWise] + (Unnamed Layer* 57) [Activation], (Unnamed Layer* 58) [Convolution] + (Unnamed Layer* 60) [Activation], (Unnamed Layer* 61) [Convolution] + (Unnamed Layer* 63) [Activation], (Unnamed Layer* 64) [Convolution] + (Unnamed Layer* 66) [ElementWise] + (Unnamed Layer* 67) [Activation], (Unnamed Layer* 68) [Convolution] + (Unnamed Layer* 70) [Activation], (Unnamed Layer* 71) [Convolution] + (Unnamed Layer* 73) [Activation], (Unnamed Layer* 74) [Convolution] + (Unnamed Layer* 76) [ElementWise] + (Unnamed Layer* 77) [Activation], (Unnamed Layer* 78) [Convolution] + (Unnamed Layer* 80) [Activation], (Unnamed Layer* 81) [Convolution] + (Unnamed Layer* 83) [Activation], (Unnamed Layer* 86) [Convolution], (Unnamed Layer* 84) [Convolution] + (Unnamed Layer* 88) [ElementWise] + (Unnamed Layer* 89) [Activation], (Unnamed Layer* 90) [Convolution] + (Unnamed Layer* 92) [Activation], (Unnamed Layer* 93) [Convolution] + (Unnamed Layer* 95) [Activation], (Unnamed Layer* 96) [Convolution] + (Unnamed Layer* 98) [ElementWise] + (Unnamed Layer* 99) [Activation], (Unnamed Layer* 100) [Convolution] + (Unnamed Layer* 102) [Activation], (Unnamed Layer* 103) [Convolution] + (Unnamed Layer* 105) [Activation], (Unnamed Layer* 106) [Convolution] + (Unnamed Layer* 108) [ElementWise] + (Unnamed Layer* 109) [Activation], (Unnamed Layer* 110) [Convolution] + (Unnamed Layer* 112) [Activation], (Unnamed Layer* 113) [Convolution] + (Unnamed Layer* 115) [Activation], (Unnamed Layer* 116) [Convolution] + (Unnamed Layer* 118) [ElementWise] + (Unnamed Layer* 119) [Activation], (Unnamed Layer* 120) [Convolution] + (Unnamed Layer* 122) [Activation], (Unnamed Layer* 123) [Convolution] + (Unnamed Layer* 125) [Activation], (Unnamed Layer* 126) [Convolution] + (Unnamed Layer* 128) [ElementWise] + (Unnamed Layer* 129) [Activation], (Unnamed Layer* 130) [Convolution] + (Unnamed Layer* 132) [Activation], (Unnamed Layer* 133) [Convolution] + (Unnamed Layer* 135) [Activation], (Unnamed Layer* 136) [Convolution] + (Unnamed Layer* 138) [ElementWise] + (Unnamed Layer* 139) [Activation], (Unnamed Layer* 140) [Convolution] + (Unnamed Layer* 142) [Activation], (Unnamed Layer* 143) [Convolution] + (Unnamed Layer* 145) [Activation], (Unnamed Layer* 148) [Convolution], (Unnamed Layer* 146) [Convolution] + (Unnamed Layer* 150) [ElementWise] + (Unnamed Layer* 151) [Activation], (Unnamed Layer* 152) [Convolution] + (Unnamed Layer* 154) [Activation], (Unnamed Layer* 155) [Convolution] + (Unnamed Layer* 157) [Activation], (Unnamed Layer* 158) [Convolution] + (Unnamed Layer* 160) [ElementWise] + (Unnamed Layer* 161) [Activation], (Unnamed Layer* 162) [Convolution] + (Unnamed Layer* 164) [Activation], (Unnamed Layer* 165) [Convolution] + (Unnamed Layer* 167) [Activation], (Unnamed Layer* 168) [Convolution] + (Unnamed Layer* 170) [ElementWise] + (Unnamed Layer* 171) [Activation], (Unnamed Layer* 172) [Convolution], (Unnamed Layer* 174) [Resize], (Unnamed Layer* 173) [Convolution] + (Unnamed Layer* 175) [ElementWise], (Unnamed Layer* 177) [Resize], (Unnamed Layer* 176) [Convolution] + (Unnamed Layer* 178) [ElementWise], (Unnamed Layer* 179) [Convolution], (Unnamed Layer* 180) [Activation], (Unnamed Layer* 181) [Convolution], (Unnamed Layer* 182) [Convolution], (Unnamed Layer* 183) [Convolution], (Unnamed Layer* 184) [Convolution], (Unnamed Layer* 185) [Convolution] + (Unnamed Layer* 186) [Activation] || (Unnamed Layer* 230) [Convolution] + (Unnamed Layer* 231) [Activation], (Unnamed Layer* 187) [Convolution] + (Unnamed Layer* 188) [Activation], (Unnamed Layer* 189) [Convolution] + (Unnamed Layer* 190) [Activation], (Unnamed Layer* 191) [Convolution] + (Unnamed Layer* 192) [Activation], (Unnamed Layer* 193) [Convolution], (Unnamed Layer* 194) [Convolution] + (Unnamed Layer* 195) [Activation] || (Unnamed Layer* 239) [Convolution] + (Unnamed Layer* 240) [Activation], (Unnamed Layer* 196) [Convolution] + (Unnamed Layer* 197) [Activation], (Unnamed Layer* 198) [Convolution] + (Unnamed Layer* 199) [Activation], (Unnamed Layer* 200) [Convolution] + (Unnamed Layer* 201) [Activation], (Unnamed Layer* 202) [Convolution], (Unnamed Layer* 203) [Convolution] + (Unnamed Layer* 204) [Activation] || (Unnamed Layer* 248) [Convolution] + (Unnamed Layer* 249) [Activation], (Unnamed Layer* 205) [Convolution] + (Unnamed Layer* 206) [Activation], (Unnamed Layer* 207) [Convolution] + (Unnamed Layer* 208) [Activation], (Unnamed Layer* 209) [Convolution] + (Unnamed Layer* 210) [Activation], (Unnamed Layer* 211) [Convolution], (Unnamed Layer* 212) [Convolution] + (Unnamed Layer* 213) [Activation] || (Unnamed Layer* 257) [Convolution] + (Unnamed Layer* 258) [Activation], (Unnamed Layer* 214) [Convolution] + (Unnamed Layer* 215) [Activation], (Unnamed Layer* 216) [Convolution] + (Unnamed Layer* 217) [Activation], (Unnamed Layer* 218) [Convolution] + (Unnamed Layer* 219) [Activation], (Unnamed Layer* 220) [Convolution], (Unnamed Layer* 221) [Convolution] + (Unnamed Layer* 222) [Activation] || (Unnamed Layer* 266) [Convolution] + (Unnamed Layer* 267) [Activation], (Unnamed Layer* 223) [Convolution] + (Unnamed Layer* 224) [Activation], (Unnamed Layer* 225) [Convolution] + (Unnamed Layer* 226) [Activation], (Unnamed Layer* 227) [Convolution] + (Unnamed Layer* 228) [Activation], (Unnamed Layer* 229) [Convolution], (Unnamed Layer* 232) [Convolution] + (Unnamed Layer* 233) [Activation], (Unnamed Layer* 234) [Convolution] + (Unnamed Layer* 235) [Activation], (Unnamed Layer* 236) [Convolution] + (Unnamed Layer* 237) [Activation], (Unnamed Layer* 238) [Convolution], (Unnamed Layer* 241) [Convolution] + (Unnamed Layer* 242) [Activation], (Unnamed Layer* 243) [Convolution] + (Unnamed Layer* 244) [Activation], (Unnamed Layer* 245) [Convolution] + (Unnamed Layer* 246) [Activation], (Unnamed Layer* 247) [Convolution], (Unnamed Layer* 250) [Convolution] + (Unnamed Layer* 251) [Activation], (Unnamed Layer* 252) [Convolution] + (Unnamed Layer* 253) [Activation], (Unnamed Layer* 254) [Convolution] + (Unnamed Layer* 255) [Activation], (Unnamed Layer* 256) [Convolution], (Unnamed Layer* 259) [Convolution] + (Unnamed Layer* 260) [Activation], (Unnamed Layer* 261) [Convolution] + (Unnamed Layer* 262) [Activation], (Unnamed Layer* 263) [Convolution] + (Unnamed Layer* 264) [Activation], (Unnamed Layer* 265) [Convolution], (Unnamed Layer* 268) [Convolution] + (Unnamed Layer* 269) [Activation], (Unnamed Layer* 270) [Convolution] + (Unnamed Layer* 271) [Activation], (Unnamed Layer* 272) [Convolution] + (Unnamed Layer* 273) [Activation], (Unnamed Layer* 274) [Convolution], (Unnamed Layer* 275) [Activation], (Unnamed Layer* 276) [Activation], (Unnamed Layer* 277) [Activation], (Unnamed Layer* 278) [Activation], (Unnamed Layer* 279) [Activation], 
[00/21/2020-13:13:16] [I] [TRT] Detected 1 inputs and 10 output network tensors.
[00/21/2020-13:13:26] [I] Average over 10 runs is 188.97 ms (host walltime is 189.147 ms, 99% percentile time is 227.625).
[00/21/2020-13:13:28] [I] Average over 10 runs is 185.004 ms (host walltime is 185.164 ms, 99% percentile time is 187.268).
[00/21/2020-13:13:30] [I] Average over 10 runs is 184.936 ms (host walltime is 185.094 ms, 99% percentile time is 190.731).
[00/21/2020-13:13:32] [I] Average over 10 runs is 184.605 ms (host walltime is 184.974 ms, 99% percentile time is 187.53).
[00/21/2020-13:13:34] [I] Average over 10 runs is 184.881 ms (host walltime is 185.031 ms, 99% percentile time is 190.574).
[00/21/2020-13:13:36] [I] Average over 10 runs is 185.219 ms (host walltime is 185.358 ms, 99% percentile time is 190.553).
[00/21/2020-13:13:37] [I] Average over 10 runs is 184.483 ms (host walltime is 184.625 ms, 99% percentile time is 187.044).
[00/21/2020-13:13:39] [I] Average over 10 runs is 185.186 ms (host walltime is 185.34 ms, 99% percentile time is 190.219).
[00/21/2020-13:13:41] [I] Average over 10 runs is 184.913 ms (host walltime is 185.199 ms, 99% percentile time is 187.441).
[00/21/2020-13:13:43] [I] Average over 10 runs is 184.882 ms (host walltime is 185.037 ms, 99% percentile time is 187.621).
&&&& PASSED TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --onnx=face4.onnx --fp16 --saveEngine=/xavier_ssd/face4_fp16.plan --workspace=640

How can I upgrade the onnx parser to 0.0.4? Please tell me I don’t have to re-flash my Jetson.

Regardless, I hope by addressing these two issues that we’ll see better performance running this model in Deepstream.

have you set the power mode to maxN and boost the system frequency by using:

sudo nvpmodel -m 0
sudo jetson_clocks

Yes. Doesn’t seem to help.

OK, we will check it internally.

It seems you set power mode failed.
Can you show $ sudo nvpmodel -q ?
Is it like this ?

NV Fan Mode:quiet
NV Power Mode: MAXN
0

The below is my test on my xavier platform / Jetpack 4.3 / TensorRT 6.0.1.10-1+cuda10.0

$ trtexec --onnx=./face4.onnx

[00/30/2020-11:23:28] [I] [TRT] Some tactics do not have sufficient workspace memory to run. Increasing workspace size may increase performance, please check verbose output.
[00/30/2020-11:30:17] [I] [TRT] Detected 1 inputs and 10 output network tensors.
[00/30/2020-11:30:24] [I] Average over 10 runs is 682.92 ms (host walltime is 683.099 ms, 99% percentile time is 686.469).
[00/30/2020-11:30:31] [I] Average over 10 runs is 682.432 ms (host walltime is 682.598 ms, 99% percentile time is 687.037).
[00/30/2020-11:30:38] [I] Average over 10 runs is 682.47 ms (host walltime is 682.624 ms, 99% percentile time is 686.418).
[00/30/2020-11:30:45] [I] Average over 10 runs is 682.36 ms (host walltime is 682.561 ms, 99% percentile time is 684.577).
[00/30/2020-11:30:52] [I] Average over 10 runs is 682.333 ms (host walltime is 682.475 ms, 99% percentile time is 686.181).
[00/30/2020-11:30:59] [I] Average over 10 runs is 682.366 ms (host walltime is 682.52 ms, 99% percentile time is 685.488).
[00/30/2020-11:31:05] [I] Average over 10 runs is 682.899 ms (host walltime is 683.046 ms, 99% percentile time is 686.398).
[00/30/2020-11:31:12] [I] Average over 10 runs is 682.358 ms (host walltime is 682.514 ms, 99% percentile time is 684.83).
[00/30/2020-11:31:19] [I] Average over 10 runs is 682.654 ms (host walltime is 682.789 ms, 99% percentile time is 685.62).
[00/30/2020-11:31:26] [I] Average over 10 runs is 682.444 ms (host walltime is 682.584 ms, 99% percentile time is 686.447).

$ trtexec --workspace=256 --fp16 --onnx=./face4.onnx

[00/30/2020-13:07:52] [I] [TRT] Detected 1 inputs and 10 output network tensors.
[00/30/2020-13:07:56] [I] Average over 10 runs is 189.967 ms (host walltime is 190.179 ms, 99% percentile time is 237.346).
[00/30/2020-13:07:57] [I] Average over 10 runs is 184.394 ms (host walltime is 184.542 ms, 99% percentile time is 186.663).
[00/30/2020-13:07:59] [I] Average over 10 runs is 184.784 ms (host walltime is 185.287 ms, 99% percentile time is 186.648).
[00/30/2020-13:08:01] [I] Average over 10 runs is 185.324 ms (host walltime is 185.503 ms, 99% percentile time is 188.97).
[00/30/2020-13:08:03] [I] Average over 10 runs is 184.51 ms (host walltime is 184.663 ms, 99% percentile time is 186.461).
[00/30/2020-13:08:05] [I] Average over 10 runs is 184.64 ms (host walltime is 184.794 ms, 99% percentile time is 186.77).
[00/30/2020-13:08:07] [I] Average over 10 runs is 184.532 ms (host walltime is 184.669 ms, 99% percentile time is 186.53).
[00/30/2020-13:08:08] [I] Average over 10 runs is 184.597 ms (host walltime is 184.741 ms, 99% percentile time is 187.155).
[00/30/2020-13:08:10] [I] Average over 10 runs is 184.649 ms (host walltime is 184.794 ms, 99% percentile time is 186.865).
[00/30/2020-13:08:12] [I] Average over 10 runs is 184.633 ms (host walltime is 184.771 ms, 99% percentile time is 186.765).

$ sudo nvpmodel -m 0
$ sudo jetson_clocks
$ trtexec --workspace=256 --fp16 --onnx=./face4.onn

[00/30/2020-13:22:02] [I] [TRT] Detected 1 inputs and 10 output network tensors.
[00/30/2020-13:22:04] [I] Average over 10 runs is 94.8276 ms (host walltime is 94.9591 ms, 99% percentile time is 96.4387).
[00/30/2020-13:22:05] [I] Average over 10 runs is 94.8825 ms (host walltime is 94.9819 ms, 99% percentile time is 96.3727).
[00/30/2020-13:22:06] [I] Average over 10 runs is 94.9377 ms (host walltime is 95.0429 ms, 99% percentile time is 96.4258).
[00/30/2020-13:22:07] [I] Average over 10 runs is 94.7457 ms (host walltime is 94.8559 ms, 99% percentile time is 96.3317).
[00/30/2020-13:22:08] [I] Average over 10 runs is 94.7348 ms (host walltime is 94.8515 ms, 99% percentile time is 96.2863).
[00/30/2020-13:22:09] [I] Average over 10 runs is 94.8972 ms (host walltime is 94.9951 ms, 99% percentile time is 96.5192).
[00/30/2020-13:22:10] [I] Average over 10 runs is 94.9239 ms (host walltime is 95.0268 ms, 99% percentile time is 96.4173).
[00/30/2020-13:22:11] [I] Average over 10 runs is 94.9434 ms (host walltime is 95.0389 ms, 99% percentile time is 96.3986).
[00/30/2020-13:22:11] [I] Average over 10 runs is 94.7136 ms (host walltime is 94.8192 ms, 99% percentile time is 96.3646).
[00/30/2020-13:22:12] [I] Average over 10 runs is 94.724 ms (host walltime is 94.8299 ms, 99% percentile time is 96.1695).

I get exactly same as you:

NV Fan Mode:quiet
    NV Power Mode: MAXN
    0

I’ve since found that when I convert to onnx, not all of the layers are converted, in particular, the output layers are not part of the resulting onnx file. Thus, when I read the onnx file into deepstream (converting it to TRT), the custom output parsing function causes a segment fault because it can’t find the output layers. My guess is that trtexec doesn’t see this because it doesn’t care about output nodes.

I posted this issue on the github issues forum for Nvidia/Retinanet-examples:
https://github.com/NVIDIA/retinanet-examples/issues/118

I don’t think I convinced anyone that my problem was real. It was closed without resolution.

I guess I may try yoloV3 next.