I am also getting mean: 220.275 ms (end to end 220.341 ms)
when I run /usr/src/tensorrt/bin/trtexec --loadEngine=VGG16_faster_rcnn_final.caffemodel_b1_gpu0_fp32.engine --int8 --batch=1 --useSpinWait
/opt/nvidia/deepstream/deepstream-5.0/deeptray/objectDetector_FasterRCNN# /usr/src/tensorrt/bin/trtexec --loadEngine=VGG16_faster_rcnn_final.caffemodel_b1_gpu0_fp32.engine --int8 --batch=1 --useSpinWait
&&&& RUNNING TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --loadEngine=VGG16_faster_rcnn_final.caffemodel_b1_gpu0_fp32.engine --int8 --batch=1 --useSpinWait
[10/07/2020-00:32:15] [I] === Model Options ===
[10/07/2020-00:32:15] [I] Format: *
[10/07/2020-00:32:15] [I] Model:
[10/07/2020-00:32:15] [I] Output:
[10/07/2020-00:32:15] [I] === Build Options ===
[10/07/2020-00:32:15] [I] Max batch: 1
[10/07/2020-00:32:15] [I] Workspace: 16 MB
[10/07/2020-00:32:15] [I] minTiming: 1
[10/07/2020-00:32:15] [I] avgTiming: 8
[10/07/2020-00:32:15] [I] Precision: FP32+INT8
[10/07/2020-00:32:15] [I] Calibration: Dynamic
[10/07/2020-00:32:15] [I] Safe mode: Disabled
[10/07/2020-00:32:15] [I] Save engine:
[10/07/2020-00:32:15] [I] Load engine: VGG16_faster_rcnn_final.caffemodel_b1_gpu0_fp32.engine
[10/07/2020-00:32:15] [I] Builder Cache: Enabled
[10/07/2020-00:32:15] [I] NVTX verbosity: 0
[10/07/2020-00:32:15] [I] Inputs format: fp32:CHW
[10/07/2020-00:32:15] [I] Outputs format: fp32:CHW
[10/07/2020-00:32:15] [I] Input build shapes: model
[10/07/2020-00:32:15] [I] Input calibration shapes: model
[10/07/2020-00:32:15] [I] === System Options ===
[10/07/2020-00:32:15] [I] Device: 0
[10/07/2020-00:32:15] [I] DLACore:
[10/07/2020-00:32:15] [I] Plugins:
[10/07/2020-00:32:15] [I] === Inference Options ===
[10/07/2020-00:32:15] [I] Batch: 1
[10/07/2020-00:32:15] [I] Input inference shapes: model
[10/07/2020-00:32:15] [I] Iterations: 10
[10/07/2020-00:32:15] [I] Duration: 3s (+ 200ms warm up)
[10/07/2020-00:32:15] [I] Sleep time: 0ms
[10/07/2020-00:32:15] [I] Streams: 1
[10/07/2020-00:32:15] [I] ExposeDMA: Disabled
[10/07/2020-00:32:15] [I] Spin-wait: Enabled
[10/07/2020-00:32:15] [I] Multithreading: Disabled
[10/07/2020-00:32:15] [I] CUDA Graph: Disabled
[10/07/2020-00:32:15] [I] Skip inference: Disabled
[10/07/2020-00:32:15] [I] Inputs:
[10/07/2020-00:32:15] [I] === Reporting Options ===
[10/07/2020-00:32:15] [I] Verbose: Disabled
[10/07/2020-00:32:15] [I] Averages: 10 inferences
[10/07/2020-00:32:15] [I] Percentile: 99
[10/07/2020-00:32:15] [I] Dump output: Disabled
[10/07/2020-00:32:15] [I] Profile: Disabled
[10/07/2020-00:32:15] [I] Export timing to JSON file:
[10/07/2020-00:32:15] [I] Export output to JSON file:
[10/07/2020-00:32:15] [I] Export profile to JSON file:
[10/07/2020-00:32:15] [I]
[10/07/2020-00:32:24] [I] Starting inference threads
[10/07/2020-00:32:28] [I] Warmup completed 1 queries over 200 ms
[10/07/2020-00:32:28] [I] Timing trace has 16 queries over 3.52547 s
[10/07/2020-00:32:28] [I] Trace averages of 10 runs:
[10/07/2020-00:32:28] [I] Average on 10 runs - GPU latency: 218.618 ms - Host latency: 218.739 ms (end to end 218.838 ms, enqueue 1.46477 ms)
[10/07/2020-00:32:28] [I] Host Latency
[10/07/2020-00:32:28] [I] min: 212.803 ms (end to end 212.809 ms)
[10/07/2020-00:32:28] [I] max: 236.222 ms (end to end 236.231 ms)
[10/07/2020-00:32:28] [I] mean: 220.275 ms (end to end 220.341 ms)
[10/07/2020-00:32:28] [I] median: 219.229 ms (end to end 219.358 ms)
[10/07/2020-00:32:28] [I] percentile: 236.222 ms at 99% (end to end 236.231 ms at 99%)
[10/07/2020-00:32:28] [I] throughput: 4.5384 qps
[10/07/2020-00:32:28] [I] walltime: 3.52547 s
[10/07/2020-00:32:28] [I] Enqueue Time
[10/07/2020-00:32:28] [I] min: 1.19866 ms
[10/07/2020-00:32:28] [I] max: 2.07568 ms
[10/07/2020-00:32:28] [I] median: 1.56665 ms
[10/07/2020-00:32:28] [I] GPU Compute
[10/07/2020-00:32:28] [I] min: 212.686 ms
[10/07/2020-00:32:28] [I] max: 236.108 ms
[10/07/2020-00:32:28] [I] mean: 220.156 ms
[10/07/2020-00:32:28] [I] median: 219.107 ms
[10/07/2020-00:32:28] [I] percentile: 236.108 ms at 99%
[10/07/2020-00:32:28] [I] total compute time: 3.52249 s
&&&& PASSED TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --loadEngine=VGG16_faster_rcnn_final.caffemodel_b1_gpu0_fp32.engine --int8 --batch=1 --useSpinWait
so what I am getting is the normal speed of this model?
Please noticed the inference precision is decided when the building time.
In your testing with trtexec, the model is inference with fp32 mode rather than int8.