Description
Testing my own model and using trtexec
to benchmark it I got no improvement vs if I didn’t use --fp16
. I can’t post the model here but this behaviour is reproducible with the mnist.onnx
provided in the tensorrt examples.
Environment
Jetpack45 Nano
TensorRT Version: 7.1.3
GPU Type: Nano
CUDA Version: 10.2
Relevant Files
mnist.onnx
provided in the samples and trtexec
Steps To Reproduce
Run from where the bin of trtexec is located like this:
./trtexec --onnx=/home/YOURUSER/Documents/tensorrt/data/mnist/mnist.onnx --fp16 --maxBatch=1 --batch=1 --workspace=1000 --iterations=100 --avgRuns=100 --duration=10
&&&& RUNNING TensorRT.trtexec # ./trtexec --onnx=/home/YOURUSER/Documents/tensorrt/data/mnist/mnist.onnx --maxBatch=1 --batch=1 --workspace=1000 --iterations=100 --avgRuns=100 --duration=10
[01/26/2021-15:51:50] [I] === Model Options ===
[01/26/2021-15:51:50] [I] Format: ONNX
[01/26/2021-15:51:50] [I] Model: /home/YOURUSER/Documents/tensorrt/data/mnist/mnist.onnx
[01/26/2021-15:51:50] [I] Output:
[01/26/2021-15:51:50] [I] === Build Options ===
[01/26/2021-15:51:50] [I] Max batch: 1
[01/26/2021-15:51:50] [I] Workspace: 1000 MB
[01/26/2021-15:51:50] [I] minTiming: 1
[01/26/2021-15:51:50] [I] avgTiming: 8
[01/26/2021-15:51:50] [I] Precision: FP32
[01/26/2021-15:51:50] [I] Calibration:
[01/26/2021-15:51:50] [I] Safe mode: Disabled
[01/26/2021-15:51:50] [I] Save engine:
[01/26/2021-15:51:50] [I] Load engine:
[01/26/2021-15:51:50] [I] Builder Cache: Enabled
[01/26/2021-15:51:50] [I] NVTX verbosity: 0
[01/26/2021-15:51:50] [I] Inputs format: fp32:CHW
[01/26/2021-15:51:50] [I] Outputs format: fp32:CHW
[01/26/2021-15:51:50] [I] Input build shapes: model
[01/26/2021-15:51:50] [I] Input calibration shapes: model
[01/26/2021-15:51:50] [I] === System Options ===
[01/26/2021-15:51:50] [I] Device: 0
[01/26/2021-15:51:50] [I] DLACore:
[01/26/2021-15:51:50] [I] Plugins:
[01/26/2021-15:51:50] [I] === Inference Options ===
[01/26/2021-15:51:50] [I] Batch: 1
[01/26/2021-15:51:50] [I] Input inference shapes: model
[01/26/2021-15:51:50] [I] Iterations: 100
[01/26/2021-15:51:50] [I] Duration: 10s (+ 200ms warm up)
[01/26/2021-15:51:50] [I] Sleep time: 0ms
[01/26/2021-15:51:50] [I] Streams: 1
[01/26/2021-15:51:50] [I] ExposeDMA: Disabled
[01/26/2021-15:51:50] [I] Spin-wait: Disabled
[01/26/2021-15:51:50] [I] Multithreading: Disabled
[01/26/2021-15:51:50] [I] CUDA Graph: Disabled
[01/26/2021-15:51:50] [I] Skip inference: Disabled
[01/26/2021-15:51:50] [I] Inputs:
[01/26/2021-15:51:50] [I] === Reporting Options ===
[01/26/2021-15:51:50] [I] Verbose: Disabled
[01/26/2021-15:51:50] [I] Averages: 100 inferences
[01/26/2021-15:51:50] [I] Percentile: 99
[01/26/2021-15:51:50] [I] Dump output: Disabled
[01/26/2021-15:51:50] [I] Profile: Disabled
[01/26/2021-15:51:50] [I] Export timing to JSON file:
[01/26/2021-15:51:50] [I] Export output to JSON file:
[01/26/2021-15:51:50] [I] Export profile to JSON file:
[01/26/2021-15:51:50] [I]
----------------------------------------------------------------
Input filename: /home/YOURUSER/Documents/tensorrt/data/mnist/mnist.onnx
ONNX IR version: 0.0.3
Opset version: 8
Producer name: CNTK
Producer version: 2.5.1
Domain: ai.cntk
Model version: 1
Doc string:
----------------------------------------------------------------
[01/26/2021-15:51:52] [W] [TRT] onnx2trt_utils.cpp:220: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[01/26/2021-15:51:58] [I] [TRT] Detected 1 inputs and 1 output network tensors.
[01/26/2021-15:51:58] [I] Starting inference threads
[01/26/2021-15:52:08] [I] Warmup completed 397 queries over 200 ms
[01/26/2021-15:52:08] [I] Timing trace has 24229 queries over 10.001 s
[01/26/2021-15:52:08] [I] Trace averages of 100 runs:
[01/26/2021-15:52:08] [I] Average on 100 runs - GPU latency: 0.269865 ms - Host latency: 0.368274 ms (end to end 0.419196 ms, enqueue 0.248219 ms)
[01/26/2021-15:52:08] [I] Average on 100 runs - GPU latency: 0.234758 ms - Host latency: 0.323429 ms (end to end 0.372553 ms, enqueue 0.215886 ms)
[01/26/2021-15:52:08] [I] Average on 100 runs - GPU latency: 0.234066 ms - Host latency: 0.321845 ms (end to end 0.370005 ms, enqueue 0.215288 ms)
[01/26/2021-15:52:08] [I] Average on 100 runs - GPU latency: 0.233953 ms - Host latency: 0.321686 ms (end to end 0.370298 ms, enqueue 0.215295 ms)
...
[01/26/2021-15:52:08] [I] Host Latency
[01/26/2021-15:52:08] [I] min: 0.1073 ms (end to end 0.111084 ms)
[01/26/2021-15:52:08] [I] max: 1.55615 ms (end to end 1.60938 ms)
[01/26/2021-15:52:08] [I] mean: 0.322356 ms (end to end 0.370994 ms)
[01/26/2021-15:52:08] [I] median: 0.319336 ms (end to end 0.367676 ms)
[01/26/2021-15:52:08] [I] percentile: 0.363281 ms at 99% (end to end 0.414795 ms at 99%)
[01/26/2021-15:52:08] [I] throughput: 2422.67 qps
[01/26/2021-15:52:08] [I] walltime: 10.001 s
[01/26/2021-15:52:08] [I] Enqueue Time
[01/26/2021-15:52:08] [I] min: 0.189636 ms
[01/26/2021-15:52:08] [I] max: 1.43994 ms
[01/26/2021-15:52:08] [I] median: 0.213867 ms
[01/26/2021-15:52:08] [I] GPU Compute
[01/26/2021-15:52:08] [I] min: 0.104736 ms
[01/26/2021-15:52:08] [I] max: 1.4624 ms
[01/26/2021-15:52:08] [I] mean: 0.234678 ms
[01/26/2021-15:52:08] [I] median: 0.232422 ms
[01/26/2021-15:52:08] [I] percentile: 0.266113 ms at 99%
[01/26/2021-15:52:08] [I] total compute time: 5.68601 s
&&&& PASSED
A GPU Compute median of median: 0.232422 ms
Now running without FP16 like:
./trtexec --onnx=/home/YOURUSER/Documents/tensorrt/data/mnist/mnist.onnx --maxBatch=1 --batch=1 --workspace=1000 --iterations=100 --avgRuns=100 --duration=10
And I get the following output:
&&&& RUNNING TensorRT.trtexec # ./trtexec --onnx=/home/YOURUSER/Documents/tensorrt/data/mnist/mnist.onnx --maxBatch=1 --batch=1 --workspace=1000 --iterations=100 --avgRuns=100 --duration=10
[01/26/2021-16:02:28] [I] === Model Options ===
[01/26/2021-16:02:28] [I] Format: ONNX
[01/26/2021-16:02:28] [I] Model: /home/YOURUSER/Documents/tensorrt/data/mnist/mnist.onnx
[01/26/2021-16:02:28] [I] Output:
[01/26/2021-16:02:28] [I] === Build Options ===
[01/26/2021-16:02:28] [I] Max batch: 1
[01/26/2021-16:02:28] [I] Workspace: 1000 MB
[01/26/2021-16:02:28] [I] minTiming: 1
[01/26/2021-16:02:28] [I] avgTiming: 8
[01/26/2021-16:02:28] [I] Precision: FP32
[01/26/2021-16:02:28] [I] Calibration:
[01/26/2021-16:02:28] [I] Safe mode: Disabled
[01/26/2021-16:02:28] [I] Save engine:
[01/26/2021-16:02:28] [I] Load engine:
[01/26/2021-16:02:28] [I] Builder Cache: Enabled
[01/26/2021-16:02:28] [I] NVTX verbosity: 0
[01/26/2021-16:02:28] [I] Inputs format: fp32:CHW
[01/26/2021-16:02:28] [I] Outputs format: fp32:CHW
[01/26/2021-16:02:28] [I] Input build shapes: model
[01/26/2021-16:02:28] [I] Input calibration shapes: model
[01/26/2021-16:02:28] [I] === System Options ===
[01/26/2021-16:02:28] [I] Device: 0
[01/26/2021-16:02:28] [I] DLACore:
[01/26/2021-16:02:28] [I] Plugins:
[01/26/2021-16:02:28] [I] === Inference Options ===
[01/26/2021-16:02:28] [I] Batch: 1
[01/26/2021-16:02:28] [I] Input inference shapes: model
[01/26/2021-16:02:28] [I] Iterations: 100
[01/26/2021-16:02:28] [I] Duration: 10s (+ 200ms warm up)
[01/26/2021-16:02:28] [I] Sleep time: 0ms
[01/26/2021-16:02:28] [I] Streams: 1
[01/26/2021-16:02:28] [I] ExposeDMA: Disabled
[01/26/2021-16:02:28] [I] Spin-wait: Disabled
[01/26/2021-16:02:28] [I] Multithreading: Disabled
[01/26/2021-16:02:28] [I] CUDA Graph: Disabled
[01/26/2021-16:02:28] [I] Skip inference: Disabled
[01/26/2021-16:02:28] [I] Inputs:
[01/26/2021-16:02:28] [I] === Reporting Options ===
[01/26/2021-16:02:28] [I] Verbose: Disabled
[01/26/2021-16:02:28] [I] Averages: 100 inferences
[01/26/2021-16:02:28] [I] Percentile: 99
[01/26/2021-16:02:28] [I] Dump output: Disabled
[01/26/2021-16:02:28] [I] Profile: Disabled
[01/26/2021-16:02:28] [I] Export timing to JSON file:
[01/26/2021-16:02:28] [I] Export output to JSON file:
[01/26/2021-16:02:28] [I] Export profile to JSON file:
[01/26/2021-16:02:28] [I]
----------------------------------------------------------------
Input filename: /home/YOURUSER/Documents/tensorrt/data/mnist/mnist.onnx
ONNX IR version: 0.0.3
Opset version: 8
Producer name: CNTK
Producer version: 2.5.1
Domain: ai.cntk
Model version: 1
Doc string:
----------------------------------------------------------------
[01/26/2021-16:02:29] [W] [TRT] onnx2trt_utils.cpp:220: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[01/26/2021-16:02:35] [I] [TRT] Detected 1 inputs and 1 output network tensors.
[01/26/2021-16:02:35] [I] Starting inference threads
[01/26/2021-16:02:45] [I] Warmup completed 406 queries over 200 ms
[01/26/2021-16:02:45] [I] Timing trace has 24236 queries over 10.0007 s
[01/26/2021-16:02:45] [I] Trace averages of 100 runs:
[01/26/2021-16:02:45] [I] Average on 100 runs - GPU latency: 0.236333 ms - Host latency: 0.325263 ms (end to end 0.373533 ms, enqueue 0.217607 ms)
[01/26/2021-16:02:45] [I] Average on 100 runs - GPU latency: 0.234335 ms - Host latency: 0.32195 ms (end to end 0.370292 ms, enqueue 0.215509 ms)
[01/26/2021-16:02:45] [I] Average on 100 runs - GPU latency: 0.235222 ms - Host latency: 0.322858 ms (end to end 0.371908 ms, enqueue 0.216229 ms)
[01/26/2021-16:02:45] [I] Average on 100 runs - GPU latency: 0.235175 ms - Host latency: 0.323139 ms (end to end 0.371768 ms, enqueue 0.216352 ms)
...
[01/26/2021-16:02:45] [I] Host Latency
[01/26/2021-16:02:45] [I] min: 0.298828 ms (end to end 0.338867 ms)
[01/26/2021-16:02:45] [I] max: 0.44873 ms (end to end 0.520752 ms)
[01/26/2021-16:02:45] [I] mean: 0.322743 ms (end to end 0.37141 ms)
[01/26/2021-16:02:45] [I] median: 0.31958 ms (end to end 0.368164 ms)
[01/26/2021-16:02:45] [I] percentile: 0.35791 ms at 99% (end to end 0.409668 ms at 99%)
[01/26/2021-16:02:45] [I] throughput: 2423.43 qps
[01/26/2021-16:02:45] [I] walltime: 10.0007 s
[01/26/2021-16:02:45] [I] Enqueue Time
[01/26/2021-16:02:45] [I] min: 0.191895 ms
[01/26/2021-16:02:45] [I] max: 0.335693 ms
[01/26/2021-16:02:45] [I] median: 0.213867 ms
[01/26/2021-16:02:45] [I] GPU Compute
[01/26/2021-16:02:45] [I] min: 0.210449 ms
[01/26/2021-16:02:45] [I] max: 0.355957 ms
[01/26/2021-16:02:45] [I] mean: 0.234587 ms
[01/26/2021-16:02:45] [I] median: 0.232666 ms
[01/26/2021-16:02:45] [I] percentile: 0.26416 ms at 99%
[01/26/2021-16:02:45] [I] total compute time: 5.68545 s
&&&& PASSED
A GPU median of median: 0.232666 ms
There is no performance gain, of course that with this model it probably can’t be noticed, but the model I tried (custom) has the same issues, no performance on fp16 vs fp32 with a average inference time of ~42ms in both cases. This is not the case on xavier or tx2.
I’ll try to add a bigger model to test it with but do you have any insights on this?
Kind regards