It is my mistake that the last feature sizes of the heads do not match for the two model. After correcting I got similar timings.
However I got another issue on timings. I compared timings of object detection models on tx1 with JetPack 3.3 and Jetpack 4.4 and I got better timings with Jetpack 3.3. JP 4.4 is about 10% slower. What do you advice to sustain the timings of JP-3.3 with JP-4.4?
Here I got these timings with trtexec
WITH JP 3.3:
/usr/src/tensorrt/bin/trtexec --deploy=deploy_noPP.prototxt.trt.part_0 --model=deploy_noPP.caffemodel --output=mbox_loc --output=mbox_conf_flatten --output=mbox_prior
deploy: deploy_noPP.prototxt.trt.part_0
model: deploy_noPP.caffemodel
output: mbox_loc
output: mbox_conf_flatten
output: mbox_prior
Input “data”: 3x320x640
Input “prior.0”: 2x19200x1
Input “prior.1”: 2x4800x1
Input “prior.2”: 2x1200x1
Input “prior.3”: 2x360x1
Input “prior.4”: 2x72x1
Output “mbox_loc”: 25632x1x1
Output “mbox_conf_flatten”: 12816x1x1
Output “mbox_prior”: 2x25632x1
name=data, bindingIndex=0, buffers.size()=9
name=prior.0, bindingIndex=1, buffers.size()=9
name=prior.1, bindingIndex=2, buffers.size()=9
name=prior.2, bindingIndex=3, buffers.size()=9
name=prior.3, bindingIndex=4, buffers.size()=9
name=prior.4, bindingIndex=5, buffers.size()=9
name=mbox_loc, bindingIndex=6, buffers.size()=9
name=mbox_conf_flatten, bindingIndex=7, buffers.size()=9
name=mbox_prior, bindingIndex=8, buffers.size()=9
Average over 10 runs is 28.5459 ms (percentile time is 28.753).
Average over 10 runs is 28.8295 ms (percentile time is 31.462).
Average over 10 runs is 28.5278 ms (percentile time is 28.5971).
Average over 10 runs is 28.586 ms (percentile time is 28.6647).
Average over 10 runs is 28.5782 ms (percentile time is 28.6974).
Average over 10 runs is 28.5941 ms (percentile time is 28.7016).
Average over 10 runs is 28.5956 ms (percentile time is 28.6871).
Average over 10 runs is 28.5814 ms (percentile time is 28.6945).
Average over 10 runs is 28.5708 ms (percentile time is 28.6436).
Average over 10 runs is 28.5697 ms (percentile time is 28.6441).
WITH JP 4.4:
/usr/src/tensorrt/bin/trtexec --deploy=deploy_noPP.prototxt.trt.part_0 --model=deploy_noPP.caffemodel --output=mbox_loc,mbox_conf_flatten,mbox_prior
&&&& RUNNING TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --deploy=deploy_noPP.prototxt.trt.part_0 --model=deploy_noPP.caffemodel --output=mbox_loc,mbox_conf_flatten,mbox_prior
[09/25/2020-14:08:44] [I] === Model Options ===
[09/25/2020-14:08:44] [I] Format: Caffe
[09/25/2020-14:08:44] [I] Model: deploy_noPP.caffemodel
[09/25/2020-14:08:44] [I] Prototxt: deploy_noPP.prototxt.trt.part_0
[09/25/2020-14:08:44] [I] Output: mbox_loc mbox_conf_flatten mbox_prior
[09/25/2020-14:08:44] [I] === Build Options ===
[09/25/2020-14:08:44] [I] Max batch: 1
[09/25/2020-14:08:44] [I] Workspace: 16 MB
[09/25/2020-14:08:44] [I] minTiming: 1
[09/25/2020-14:08:44] [I] avgTiming: 8
[09/25/2020-14:08:44] [I] Precision: FP32
[09/25/2020-14:08:44] [I] Calibration:
[09/25/2020-14:08:44] [I] Safe mode: Disabled
[09/25/2020-14:08:44] [I] Save engine:
[09/25/2020-14:08:44] [I] Load engine:
[09/25/2020-14:08:44] [I] Builder Cache: Enabled
[09/25/2020-14:08:44] [I] NVTX verbosity: 0
[09/25/2020-14:08:44] [I] Inputs format: fp32:CHW
[09/25/2020-14:08:44] [I] Outputs format: fp32:CHW
[09/25/2020-14:08:44] [I] Input build shapes: model
[09/25/2020-14:08:44] [I] Input calibration shapes: model
[09/25/2020-14:08:44] [I] === System Options ===
[09/25/2020-14:08:44] [I] Device: 0
[09/25/2020-14:08:44] [I] DLACore:
[09/25/2020-14:08:44] [I] Plugins:
[09/25/2020-14:08:44] [I] === Inference Options ===
[09/25/2020-14:08:44] [I] Batch: 1
[09/25/2020-14:08:44] [I] Input inference shapes: model
[09/25/2020-14:08:44] [I] Iterations: 10
[09/25/2020-14:08:44] [I] Duration: 3s (+ 200ms warm up)
[09/25/2020-14:08:44] [I] Sleep time: 0ms
[09/25/2020-14:08:44] [I] Streams: 1
[09/25/2020-14:08:44] [I] ExposeDMA: Disabled
[09/25/2020-14:08:44] [I] Spin-wait: Disabled
[09/25/2020-14:08:44] [I] Multithreading: Disabled
[09/25/2020-14:08:44] [I] CUDA Graph: Disabled
[09/25/2020-14:08:44] [I] Skip inference: Disabled
[09/25/2020-14:08:44] [I] Inputs:
[09/25/2020-14:08:44] [I] === Reporting Options ===
[09/25/2020-14:08:44] [I] Verbose: Disabled
[09/25/2020-14:08:44] [I] Averages: 10 inferences
[09/25/2020-14:08:44] [I] Percentile: 99
[09/25/2020-14:08:44] [I] Dump output: Disabled
[09/25/2020-14:08:44] [I] Profile: Disabled
[09/25/2020-14:08:44] [I] Export timing to JSON file:
[09/25/2020-14:08:44] [I] Export output to JSON file:
[09/25/2020-14:08:44] [I] Export profile to JSON file:
[09/25/2020-14:08:44] [I]
[09/25/2020-14:09:58] [I] [TRT] Some tactics do not have sufficient workspace memory to run. Increasing workspace size may increase performance, please check verbose output.
[09/25/2020-14:10:53] [I] [TRT] Detected 6 inputs and 13 output network tensors.
[09/25/2020-14:10:53] [I] Starting inference threads
[09/25/2020-14:10:57] [I] Warmup completed 7 queries over 200 ms
[09/25/2020-14:10:57] [I] Timing trace has 95 queries over 3.06308 s
[09/25/2020-14:10:57] [I] Trace averages of 10 runs:
[09/25/2020-14:10:57] [I] Average on 10 runs - GPU latency: 32.1345 ms - Host latency: 32.4467 ms (end to end 32.4558 ms, enqueue 8.37465 ms)
[09/25/2020-14:10:57] [I] Average on 10 runs - GPU latency: 31.6689 ms - Host latency: 31.9755 ms (end to end 31.9849 ms, enqueue 8.03756 ms)
[09/25/2020-14:10:57] [I] Average on 10 runs - GPU latency: 32.083 ms - Host latency: 32.3894 ms (end to end 32.399 ms, enqueue 8.20562 ms)
[09/25/2020-14:10:57] [I] Average on 10 runs - GPU latency: 31.8626 ms - Host latency: 32.1704 ms (end to end 32.18 ms, enqueue 8.29167 ms)
[09/25/2020-14:10:57] [I] Average on 10 runs - GPU latency: 32.1569 ms - Host latency: 32.4637 ms (end to end 32.4731 ms, enqueue 8.24586 ms)
[09/25/2020-14:10:57] [I] Average on 10 runs - GPU latency: 32.0583 ms - Host latency: 32.3659 ms (end to end 32.3753 ms, enqueue 8.31488 ms)
[09/25/2020-14:10:57] [I] Average on 10 runs - GPU latency: 31.8006 ms - Host latency: 32.1084 ms (end to end 32.1177 ms, enqueue 8.27681 ms)
[09/25/2020-14:10:57] [I] Average on 10 runs - GPU latency: 31.6394 ms - Host latency: 31.9441 ms (end to end 31.9534 ms, enqueue 8.35918 ms)
[09/25/2020-14:10:57] [I] Average on 10 runs - GPU latency: 32.0319 ms - Host latency: 32.3396 ms (end to end 32.3492 ms, enqueue 8.31528 ms)
[09/25/2020-14:10:57] [I] Host Latency
[09/25/2020-14:10:57] [I] min: 31.7332 ms (end to end 31.7427 ms)
[09/25/2020-14:10:57] [I] max: 36.2266 ms (end to end 36.2377 ms)
[09/25/2020-14:10:57] [I] mean: 32.233 ms (end to end 32.2423 ms)
[09/25/2020-14:10:57] [I] median: 31.9709 ms (end to end 31.9802 ms)
[09/25/2020-14:10:57] [I] percentile: 36.2266 ms at 99% (end to end 36.2377 ms at 99%)
[09/25/2020-14:10:57] [I] throughput: 31.0146 qps
[09/25/2020-14:10:57] [I] walltime: 3.06308 s
[09/25/2020-14:10:57] [I] Enqueue Time
[09/25/2020-14:10:57] [I] min: 6.81958 ms
[09/25/2020-14:10:57] [I] max: 9.20395 ms
[09/25/2020-14:10:57] [I] median: 8.27722 ms
[09/25/2020-14:10:57] [I] GPU Compute
[09/25/2020-14:10:57] [I] min: 31.4289 ms
[09/25/2020-14:10:57] [I] max: 35.9196 ms
[09/25/2020-14:10:57] [I] mean: 31.9256 ms
[09/25/2020-14:10:57] [I] median: 31.6644 ms
[09/25/2020-14:10:57] [I] percentile: 35.9196 ms at 99%
[09/25/2020-14:10:57] [I] total compute time: 3.03293 s
&&&& PASSED TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --deploy=deploy_noPP.prototxt.trt.part_0 --model=deploy_noPP.caffemodel --output=mbox_loc,mbox_conf_flatten,mbox_prior