Inference time changes after training

Description

In order to satisfy the time budget of a project I tried random weight initialized object detection models and get their inference timings on TX1. But after training the model satisfying the time budget I can not get the same inference timings. Is it an expected case?

Environment

Tx1-JetPack 4.4

TensorRT Version:
GPU Type:
Nvidia Driver Version:
CUDA Version:
CUDNN Version:
Operating System + Version:
Python Version (if applicable):
TensorFlow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag):

Relevant Files

Please attach or include links to any models, data, files, or scripts necessary to reproduce your issue. (Github repo, Google Drive, Dropbox, etc.)

Steps To Reproduce

Please include:

  • Exact steps/commands to build your repro
  • Exact steps/commands to run your repro
  • Full traceback of errors encountered

Hi @Bozkalayci1,
I think the Jetpack team should be able to help you better here.

Thanks!

the onnx files are infered via trtexec and timings for initial and trained ones differ significantly.

TRAINED MODEL:
[09/17/2020-17:11:40] [I] GPU Compute
[09/17/2020-17:11:40] [I] min: 17.8423 ms
[09/17/2020-17:11:40] [I] max: 18.7887 ms
[09/17/2020-17:11:40] [I] mean: 18.1339 ms
[09/17/2020-17:11:40] [I] median: 18.115 ms
[09/17/2020-17:11:40] [I] percentile: 18.7114 ms at 99%

INITIAL MODEL:
[09/17/2020-17:15:24] [I] GPU Compute
[09/17/2020-17:15:24] [I] min: 10.2354 ms
[09/17/2020-17:15:24] [I] max: 10.7039 ms
[09/17/2020-17:15:24] [I] mean: 10.423 ms
[09/17/2020-17:15:24] [I] median: 10.425 ms
[09/17/2020-17:15:24] [I] percentile: 10.6045 ms at 99%

Hi @Bozkalayci1,
Request you to share your onnx model.

Thanks!

It is my mistake that the last feature sizes of the heads do not match for the two model. After correcting I got similar timings.

However I got another issue on timings. I compared timings of object detection models on tx1 with JetPack 3.3 and Jetpack 4.4 and I got better timings with Jetpack 3.3. JP 4.4 is about 10% slower. What do you advice to sustain the timings of JP-3.3 with JP-4.4?

Here I got these timings with trtexec

WITH JP 3.3:
/usr/src/tensorrt/bin/trtexec --deploy=deploy_noPP.prototxt.trt.part_0 --model=deploy_noPP.caffemodel --output=mbox_loc --output=mbox_conf_flatten --output=mbox_prior
deploy: deploy_noPP.prototxt.trt.part_0
model: deploy_noPP.caffemodel
output: mbox_loc
output: mbox_conf_flatten
output: mbox_prior
Input “data”: 3x320x640
Input “prior.0”: 2x19200x1
Input “prior.1”: 2x4800x1
Input “prior.2”: 2x1200x1
Input “prior.3”: 2x360x1
Input “prior.4”: 2x72x1
Output “mbox_loc”: 25632x1x1
Output “mbox_conf_flatten”: 12816x1x1
Output “mbox_prior”: 2x25632x1
name=data, bindingIndex=0, buffers.size()=9
name=prior.0, bindingIndex=1, buffers.size()=9
name=prior.1, bindingIndex=2, buffers.size()=9
name=prior.2, bindingIndex=3, buffers.size()=9
name=prior.3, bindingIndex=4, buffers.size()=9
name=prior.4, bindingIndex=5, buffers.size()=9
name=mbox_loc, bindingIndex=6, buffers.size()=9
name=mbox_conf_flatten, bindingIndex=7, buffers.size()=9
name=mbox_prior, bindingIndex=8, buffers.size()=9
Average over 10 runs is 28.5459 ms (percentile time is 28.753).
Average over 10 runs is 28.8295 ms (percentile time is 31.462).
Average over 10 runs is 28.5278 ms (percentile time is 28.5971).
Average over 10 runs is 28.586 ms (percentile time is 28.6647).
Average over 10 runs is 28.5782 ms (percentile time is 28.6974).
Average over 10 runs is 28.5941 ms (percentile time is 28.7016).
Average over 10 runs is 28.5956 ms (percentile time is 28.6871).
Average over 10 runs is 28.5814 ms (percentile time is 28.6945).
Average over 10 runs is 28.5708 ms (percentile time is 28.6436).
Average over 10 runs is 28.5697 ms (percentile time is 28.6441).

WITH JP 4.4:
/usr/src/tensorrt/bin/trtexec --deploy=deploy_noPP.prototxt.trt.part_0 --model=deploy_noPP.caffemodel --output=mbox_loc,mbox_conf_flatten,mbox_prior
&&&& RUNNING TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --deploy=deploy_noPP.prototxt.trt.part_0 --model=deploy_noPP.caffemodel --output=mbox_loc,mbox_conf_flatten,mbox_prior
[09/25/2020-14:08:44] [I] === Model Options ===
[09/25/2020-14:08:44] [I] Format: Caffe
[09/25/2020-14:08:44] [I] Model: deploy_noPP.caffemodel
[09/25/2020-14:08:44] [I] Prototxt: deploy_noPP.prototxt.trt.part_0
[09/25/2020-14:08:44] [I] Output: mbox_loc mbox_conf_flatten mbox_prior
[09/25/2020-14:08:44] [I] === Build Options ===
[09/25/2020-14:08:44] [I] Max batch: 1
[09/25/2020-14:08:44] [I] Workspace: 16 MB
[09/25/2020-14:08:44] [I] minTiming: 1
[09/25/2020-14:08:44] [I] avgTiming: 8
[09/25/2020-14:08:44] [I] Precision: FP32
[09/25/2020-14:08:44] [I] Calibration:
[09/25/2020-14:08:44] [I] Safe mode: Disabled
[09/25/2020-14:08:44] [I] Save engine:
[09/25/2020-14:08:44] [I] Load engine:
[09/25/2020-14:08:44] [I] Builder Cache: Enabled
[09/25/2020-14:08:44] [I] NVTX verbosity: 0
[09/25/2020-14:08:44] [I] Inputs format: fp32:CHW
[09/25/2020-14:08:44] [I] Outputs format: fp32:CHW
[09/25/2020-14:08:44] [I] Input build shapes: model
[09/25/2020-14:08:44] [I] Input calibration shapes: model
[09/25/2020-14:08:44] [I] === System Options ===
[09/25/2020-14:08:44] [I] Device: 0
[09/25/2020-14:08:44] [I] DLACore:
[09/25/2020-14:08:44] [I] Plugins:
[09/25/2020-14:08:44] [I] === Inference Options ===
[09/25/2020-14:08:44] [I] Batch: 1
[09/25/2020-14:08:44] [I] Input inference shapes: model
[09/25/2020-14:08:44] [I] Iterations: 10
[09/25/2020-14:08:44] [I] Duration: 3s (+ 200ms warm up)
[09/25/2020-14:08:44] [I] Sleep time: 0ms
[09/25/2020-14:08:44] [I] Streams: 1
[09/25/2020-14:08:44] [I] ExposeDMA: Disabled
[09/25/2020-14:08:44] [I] Spin-wait: Disabled
[09/25/2020-14:08:44] [I] Multithreading: Disabled
[09/25/2020-14:08:44] [I] CUDA Graph: Disabled
[09/25/2020-14:08:44] [I] Skip inference: Disabled
[09/25/2020-14:08:44] [I] Inputs:
[09/25/2020-14:08:44] [I] === Reporting Options ===
[09/25/2020-14:08:44] [I] Verbose: Disabled
[09/25/2020-14:08:44] [I] Averages: 10 inferences
[09/25/2020-14:08:44] [I] Percentile: 99
[09/25/2020-14:08:44] [I] Dump output: Disabled
[09/25/2020-14:08:44] [I] Profile: Disabled
[09/25/2020-14:08:44] [I] Export timing to JSON file:
[09/25/2020-14:08:44] [I] Export output to JSON file:
[09/25/2020-14:08:44] [I] Export profile to JSON file:
[09/25/2020-14:08:44] [I]
[09/25/2020-14:09:58] [I] [TRT] Some tactics do not have sufficient workspace memory to run. Increasing workspace size may increase performance, please check verbose output.
[09/25/2020-14:10:53] [I] [TRT] Detected 6 inputs and 13 output network tensors.
[09/25/2020-14:10:53] [I] Starting inference threads
[09/25/2020-14:10:57] [I] Warmup completed 7 queries over 200 ms
[09/25/2020-14:10:57] [I] Timing trace has 95 queries over 3.06308 s
[09/25/2020-14:10:57] [I] Trace averages of 10 runs:
[09/25/2020-14:10:57] [I] Average on 10 runs - GPU latency: 32.1345 ms - Host latency: 32.4467 ms (end to end 32.4558 ms, enqueue 8.37465 ms)
[09/25/2020-14:10:57] [I] Average on 10 runs - GPU latency: 31.6689 ms - Host latency: 31.9755 ms (end to end 31.9849 ms, enqueue 8.03756 ms)
[09/25/2020-14:10:57] [I] Average on 10 runs - GPU latency: 32.083 ms - Host latency: 32.3894 ms (end to end 32.399 ms, enqueue 8.20562 ms)
[09/25/2020-14:10:57] [I] Average on 10 runs - GPU latency: 31.8626 ms - Host latency: 32.1704 ms (end to end 32.18 ms, enqueue 8.29167 ms)
[09/25/2020-14:10:57] [I] Average on 10 runs - GPU latency: 32.1569 ms - Host latency: 32.4637 ms (end to end 32.4731 ms, enqueue 8.24586 ms)
[09/25/2020-14:10:57] [I] Average on 10 runs - GPU latency: 32.0583 ms - Host latency: 32.3659 ms (end to end 32.3753 ms, enqueue 8.31488 ms)
[09/25/2020-14:10:57] [I] Average on 10 runs - GPU latency: 31.8006 ms - Host latency: 32.1084 ms (end to end 32.1177 ms, enqueue 8.27681 ms)
[09/25/2020-14:10:57] [I] Average on 10 runs - GPU latency: 31.6394 ms - Host latency: 31.9441 ms (end to end 31.9534 ms, enqueue 8.35918 ms)
[09/25/2020-14:10:57] [I] Average on 10 runs - GPU latency: 32.0319 ms - Host latency: 32.3396 ms (end to end 32.3492 ms, enqueue 8.31528 ms)
[09/25/2020-14:10:57] [I] Host Latency
[09/25/2020-14:10:57] [I] min: 31.7332 ms (end to end 31.7427 ms)
[09/25/2020-14:10:57] [I] max: 36.2266 ms (end to end 36.2377 ms)
[09/25/2020-14:10:57] [I] mean: 32.233 ms (end to end 32.2423 ms)
[09/25/2020-14:10:57] [I] median: 31.9709 ms (end to end 31.9802 ms)
[09/25/2020-14:10:57] [I] percentile: 36.2266 ms at 99% (end to end 36.2377 ms at 99%)
[09/25/2020-14:10:57] [I] throughput: 31.0146 qps
[09/25/2020-14:10:57] [I] walltime: 3.06308 s
[09/25/2020-14:10:57] [I] Enqueue Time
[09/25/2020-14:10:57] [I] min: 6.81958 ms
[09/25/2020-14:10:57] [I] max: 9.20395 ms
[09/25/2020-14:10:57] [I] median: 8.27722 ms
[09/25/2020-14:10:57] [I] GPU Compute
[09/25/2020-14:10:57] [I] min: 31.4289 ms
[09/25/2020-14:10:57] [I] max: 35.9196 ms
[09/25/2020-14:10:57] [I] mean: 31.9256 ms
[09/25/2020-14:10:57] [I] median: 31.6644 ms
[09/25/2020-14:10:57] [I] percentile: 35.9196 ms at 99%
[09/25/2020-14:10:57] [I] total compute time: 3.03293 s
&&&& PASSED TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --deploy=deploy_noPP.prototxt.trt.part_0 --model=deploy_noPP.caffemodel --output=mbox_loc,mbox_conf_flatten,mbox_prior

Hi @Bozkalayci1,
In this case, i would suggest you to reach out to Jetson Team as they will be able to help you better here.

Thanks!