No performance improvement on Jetson Nano FP16 vs FP32

bpinaya · January 26, 2021, 3:09pm

Description

Testing my own model and using trtexec to benchmark it I got no improvement vs if I didn’t use --fp16. I can’t post the model here but this behaviour is reproducible with the mnist.onnx provided in the tensorrt examples.

Environment

Jetpack45 Nano
TensorRT Version: 7.1.3
GPU Type: Nano
CUDA Version: 10.2

Relevant Files

mnist.onnx provided in the samples and trtexec

Steps To Reproduce

Run from where the bin of trtexec is located like this:

./trtexec --onnx=/home/YOURUSER/Documents/tensorrt/data/mnist/mnist.onnx --fp16 --maxBatch=1 --batch=1 --workspace=1000 --iterations=100 --avgRuns=100 --duration=10

&&&& RUNNING TensorRT.trtexec # ./trtexec --onnx=/home/YOURUSER/Documents/tensorrt/data/mnist/mnist.onnx --maxBatch=1 --batch=1 --workspace=1000 --iterations=100 --avgRuns=100 --duration=10                          
[01/26/2021-15:51:50] [I] === Model Options ===                                                                                                                                                                      
[01/26/2021-15:51:50] [I] Format: ONNX                                                                                                                                                                               
[01/26/2021-15:51:50] [I] Model: /home/YOURUSER/Documents/tensorrt/data/mnist/mnist.onnx                                                                                                                               
[01/26/2021-15:51:50] [I] Output:                                                                                                                                                                                    
[01/26/2021-15:51:50] [I] === Build Options ===                                                                                                                                                                      
[01/26/2021-15:51:50] [I] Max batch: 1                                                                                                                                                                               
[01/26/2021-15:51:50] [I] Workspace: 1000 MB                                                                                                                                                                         
[01/26/2021-15:51:50] [I] minTiming: 1                                                                                                                                                                               
[01/26/2021-15:51:50] [I] avgTiming: 8                                                                                                                                                                               
[01/26/2021-15:51:50] [I] Precision: FP32                                                                                                                                                                            
[01/26/2021-15:51:50] [I] Calibration:                                                                                                                                                                               
[01/26/2021-15:51:50] [I] Safe mode: Disabled                                                                                                                                                                        
[01/26/2021-15:51:50] [I] Save engine:                                                                                                                                                                               
[01/26/2021-15:51:50] [I] Load engine:                                                                                                                                                                               
[01/26/2021-15:51:50] [I] Builder Cache: Enabled                                                                                                                                                                     
[01/26/2021-15:51:50] [I] NVTX verbosity: 0                                                                                                                                                                          
[01/26/2021-15:51:50] [I] Inputs format: fp32:CHW                                                                                                                                                                    
[01/26/2021-15:51:50] [I] Outputs format: fp32:CHW                                                                                                                                                                   
[01/26/2021-15:51:50] [I] Input build shapes: model                                                                                                                                                                  
[01/26/2021-15:51:50] [I] Input calibration shapes: model                                                                                                                                                            
[01/26/2021-15:51:50] [I] === System Options ===                                                                                                                                                                     
[01/26/2021-15:51:50] [I] Device: 0                                                                                                                                                                                  
[01/26/2021-15:51:50] [I] DLACore:                                                                                                                                                                                   
[01/26/2021-15:51:50] [I] Plugins:                                                                                                                                                                                   
[01/26/2021-15:51:50] [I] === Inference Options ===                                                                                                                                                                  
[01/26/2021-15:51:50] [I] Batch: 1                                                                                                                                                                                   
[01/26/2021-15:51:50] [I] Input inference shapes: model                                                                                                                                                              
[01/26/2021-15:51:50] [I] Iterations: 100                                                                                                                                                                            
[01/26/2021-15:51:50] [I] Duration: 10s (+ 200ms warm up)                                                                                                                                                            
[01/26/2021-15:51:50] [I] Sleep time: 0ms                                                                                                                                                                            
[01/26/2021-15:51:50] [I] Streams: 1                                                                                                                                                                                 
[01/26/2021-15:51:50] [I] ExposeDMA: Disabled                                                                                                                                                                        
[01/26/2021-15:51:50] [I] Spin-wait: Disabled                                                                                                                                                                        
[01/26/2021-15:51:50] [I] Multithreading: Disabled                                                                                                                                                                   
[01/26/2021-15:51:50] [I] CUDA Graph: Disabled                                                                                                                                                                       
[01/26/2021-15:51:50] [I] Skip inference: Disabled                                                                                                                                                                   
[01/26/2021-15:51:50] [I] Inputs:      
[01/26/2021-15:51:50] [I] === Reporting Options ===                                                                                                                                                                  
[01/26/2021-15:51:50] [I] Verbose: Disabled                                                                                                                                                                          
[01/26/2021-15:51:50] [I] Averages: 100 inferences                                                                                                                                                                   
[01/26/2021-15:51:50] [I] Percentile: 99                                                                                                                                                                             
[01/26/2021-15:51:50] [I] Dump output: Disabled                                                                                                                                                                      
[01/26/2021-15:51:50] [I] Profile: Disabled                                                                                                                                                                          
[01/26/2021-15:51:50] [I] Export timing to JSON file:
[01/26/2021-15:51:50] [I] Export output to JSON file:
[01/26/2021-15:51:50] [I] Export profile to JSON file:
[01/26/2021-15:51:50] [I]
----------------------------------------------------------------
Input filename:   /home/YOURUSER/Documents/tensorrt/data/mnist/mnist.onnx
ONNX IR version:  0.0.3
Opset version:    8
Producer name:    CNTK
Producer version: 2.5.1
Domain:           ai.cntk
Model version:    1
Doc string:
----------------------------------------------------------------
[01/26/2021-15:51:52] [W] [TRT] onnx2trt_utils.cpp:220: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[01/26/2021-15:51:58] [I] [TRT] Detected 1 inputs and 1 output network tensors.
[01/26/2021-15:51:58] [I] Starting inference threads
[01/26/2021-15:52:08] [I] Warmup completed 397 queries over 200 ms
[01/26/2021-15:52:08] [I] Timing trace has 24229 queries over 10.001 s
[01/26/2021-15:52:08] [I] Trace averages of 100 runs:
[01/26/2021-15:52:08] [I] Average on 100 runs - GPU latency: 0.269865 ms - Host latency: 0.368274 ms (end to end 0.419196 ms, enqueue 0.248219 ms)
[01/26/2021-15:52:08] [I] Average on 100 runs - GPU latency: 0.234758 ms - Host latency: 0.323429 ms (end to end 0.372553 ms, enqueue 0.215886 ms)
[01/26/2021-15:52:08] [I] Average on 100 runs - GPU latency: 0.234066 ms - Host latency: 0.321845 ms (end to end 0.370005 ms, enqueue 0.215288 ms)
[01/26/2021-15:52:08] [I] Average on 100 runs - GPU latency: 0.233953 ms - Host latency: 0.321686 ms (end to end 0.370298 ms, enqueue 0.215295 ms)
...

[01/26/2021-15:52:08] [I] Host Latency
[01/26/2021-15:52:08] [I] min: 0.1073 ms (end to end 0.111084 ms)
[01/26/2021-15:52:08] [I] max: 1.55615 ms (end to end 1.60938 ms)
[01/26/2021-15:52:08] [I] mean: 0.322356 ms (end to end 0.370994 ms)
[01/26/2021-15:52:08] [I] median: 0.319336 ms (end to end 0.367676 ms)
[01/26/2021-15:52:08] [I] percentile: 0.363281 ms at 99% (end to end 0.414795 ms at 99%)
[01/26/2021-15:52:08] [I] throughput: 2422.67 qps
[01/26/2021-15:52:08] [I] walltime: 10.001 s
[01/26/2021-15:52:08] [I] Enqueue Time
[01/26/2021-15:52:08] [I] min: 0.189636 ms
[01/26/2021-15:52:08] [I] max: 1.43994 ms
[01/26/2021-15:52:08] [I] median: 0.213867 ms
[01/26/2021-15:52:08] [I] GPU Compute
[01/26/2021-15:52:08] [I] min: 0.104736 ms
[01/26/2021-15:52:08] [I] max: 1.4624 ms
[01/26/2021-15:52:08] [I] mean: 0.234678 ms
[01/26/2021-15:52:08] [I] median: 0.232422 ms
[01/26/2021-15:52:08] [I] percentile: 0.266113 ms at 99%
[01/26/2021-15:52:08] [I] total compute time: 5.68601 s
&&&& PASSED

A GPU Compute median of median: 0.232422 ms

Now running without FP16 like:

./trtexec --onnx=/home/YOURUSER/Documents/tensorrt/data/mnist/mnist.onnx --maxBatch=1 --batch=1 --workspace=1000 --iterations=100 --avgRuns=100 --duration=10

And I get the following output:

&&&& RUNNING TensorRT.trtexec # ./trtexec --onnx=/home/YOURUSER/Documents/tensorrt/data/mnist/mnist.onnx --maxBatch=1 --batch=1 --workspace=1000 --iterations=100 --avgRuns=100 --duration=10                          
[01/26/2021-16:02:28] [I] === Model Options ===                                                                                                                                                                      
[01/26/2021-16:02:28] [I] Format: ONNX                                                                                                                                                                               
[01/26/2021-16:02:28] [I] Model: /home/YOURUSER/Documents/tensorrt/data/mnist/mnist.onnx                                                                                                                               
[01/26/2021-16:02:28] [I] Output:                                                                                                                                                                                    
[01/26/2021-16:02:28] [I] === Build Options ===                                                                                                                                                                      
[01/26/2021-16:02:28] [I] Max batch: 1                                                                                                                                                                               
[01/26/2021-16:02:28] [I] Workspace: 1000 MB                                                                                                                                                                         
[01/26/2021-16:02:28] [I] minTiming: 1                                                                                                                                                                               
[01/26/2021-16:02:28] [I] avgTiming: 8                                                                                                                                                                               
[01/26/2021-16:02:28] [I] Precision: FP32                                                                                                                                                                            
[01/26/2021-16:02:28] [I] Calibration:                                                                                                                                                                               
[01/26/2021-16:02:28] [I] Safe mode: Disabled                                                                                                                                                                        
[01/26/2021-16:02:28] [I] Save engine:                                                                                                                                                                               
[01/26/2021-16:02:28] [I] Load engine:                                                                                                                                                                               
[01/26/2021-16:02:28] [I] Builder Cache: Enabled                                                                                                                                                                     
[01/26/2021-16:02:28] [I] NVTX verbosity: 0                                                                                                                                                                          
[01/26/2021-16:02:28] [I] Inputs format: fp32:CHW                                                                                                                                                                    
[01/26/2021-16:02:28] [I] Outputs format: fp32:CHW                                                                                                                                                                   
[01/26/2021-16:02:28] [I] Input build shapes: model                                                                                                                                                                  
[01/26/2021-16:02:28] [I] Input calibration shapes: model                                                                                                                                                            
[01/26/2021-16:02:28] [I] === System Options ===                                                                                                                                                                     
[01/26/2021-16:02:28] [I] Device: 0                                                                                                                                                                                  
[01/26/2021-16:02:28] [I] DLACore:
[01/26/2021-16:02:28] [I] Plugins:
[01/26/2021-16:02:28] [I] === Inference Options ===
[01/26/2021-16:02:28] [I] Batch: 1
[01/26/2021-16:02:28] [I] Input inference shapes: model
[01/26/2021-16:02:28] [I] Iterations: 100
[01/26/2021-16:02:28] [I] Duration: 10s (+ 200ms warm up)
[01/26/2021-16:02:28] [I] Sleep time: 0ms
[01/26/2021-16:02:28] [I] Streams: 1
[01/26/2021-16:02:28] [I] ExposeDMA: Disabled
[01/26/2021-16:02:28] [I] Spin-wait: Disabled
[01/26/2021-16:02:28] [I] Multithreading: Disabled
[01/26/2021-16:02:28] [I] CUDA Graph: Disabled
[01/26/2021-16:02:28] [I] Skip inference: Disabled
[01/26/2021-16:02:28] [I] Inputs:
[01/26/2021-16:02:28] [I] === Reporting Options ===
[01/26/2021-16:02:28] [I] Verbose: Disabled
[01/26/2021-16:02:28] [I] Averages: 100 inferences
[01/26/2021-16:02:28] [I] Percentile: 99
[01/26/2021-16:02:28] [I] Dump output: Disabled
[01/26/2021-16:02:28] [I] Profile: Disabled
[01/26/2021-16:02:28] [I] Export timing to JSON file:
[01/26/2021-16:02:28] [I] Export output to JSON file:
[01/26/2021-16:02:28] [I] Export profile to JSON file:
[01/26/2021-16:02:28] [I]
----------------------------------------------------------------
Input filename:   /home/YOURUSER/Documents/tensorrt/data/mnist/mnist.onnx
ONNX IR version:  0.0.3
Opset version:    8
Producer name:    CNTK
Producer version: 2.5.1
Domain:           ai.cntk
Model version:    1
Doc string:
----------------------------------------------------------------
[01/26/2021-16:02:29] [W] [TRT] onnx2trt_utils.cpp:220: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[01/26/2021-16:02:35] [I] [TRT] Detected 1 inputs and 1 output network tensors.
[01/26/2021-16:02:35] [I] Starting inference threads
[01/26/2021-16:02:45] [I] Warmup completed 406 queries over 200 ms
[01/26/2021-16:02:45] [I] Timing trace has 24236 queries over 10.0007 s
[01/26/2021-16:02:45] [I] Trace averages of 100 runs:
[01/26/2021-16:02:45] [I] Average on 100 runs - GPU latency: 0.236333 ms - Host latency: 0.325263 ms (end to end 0.373533 ms, enqueue 0.217607 ms)
[01/26/2021-16:02:45] [I] Average on 100 runs - GPU latency: 0.234335 ms - Host latency: 0.32195 ms (end to end 0.370292 ms, enqueue 0.215509 ms)
[01/26/2021-16:02:45] [I] Average on 100 runs - GPU latency: 0.235222 ms - Host latency: 0.322858 ms (end to end 0.371908 ms, enqueue 0.216229 ms)
[01/26/2021-16:02:45] [I] Average on 100 runs - GPU latency: 0.235175 ms - Host latency: 0.323139 ms (end to end 0.371768 ms, enqueue 0.216352 ms)
...
[01/26/2021-16:02:45] [I] Host Latency
[01/26/2021-16:02:45] [I] min: 0.298828 ms (end to end 0.338867 ms)
[01/26/2021-16:02:45] [I] max: 0.44873 ms (end to end 0.520752 ms)
[01/26/2021-16:02:45] [I] mean: 0.322743 ms (end to end 0.37141 ms)
[01/26/2021-16:02:45] [I] median: 0.31958 ms (end to end 0.368164 ms)
[01/26/2021-16:02:45] [I] percentile: 0.35791 ms at 99% (end to end 0.409668 ms at 99%)
[01/26/2021-16:02:45] [I] throughput: 2423.43 qps
[01/26/2021-16:02:45] [I] walltime: 10.0007 s
[01/26/2021-16:02:45] [I] Enqueue Time
[01/26/2021-16:02:45] [I] min: 0.191895 ms
[01/26/2021-16:02:45] [I] max: 0.335693 ms
[01/26/2021-16:02:45] [I] median: 0.213867 ms
[01/26/2021-16:02:45] [I] GPU Compute
[01/26/2021-16:02:45] [I] min: 0.210449 ms
[01/26/2021-16:02:45] [I] max: 0.355957 ms
[01/26/2021-16:02:45] [I] mean: 0.234587 ms
[01/26/2021-16:02:45] [I] median: 0.232666 ms
[01/26/2021-16:02:45] [I] percentile: 0.26416 ms at 99%
[01/26/2021-16:02:45] [I] total compute time: 5.68545 s
&&&& PASSED

A GPU median of median: 0.232666 ms

There is no performance gain, of course that with this model it probably can’t be noticed, but the model I tried (custom) has the same issues, no performance on fp16 vs fp32 with a average inference time of ~42ms in both cases. This is not the case on xavier or tx2.

I’ll try to add a bigger model to test it with but do you have any insights on this?
Kind regards

NVES · January 26, 2021, 3:37pm

Hi, Request you to share the model, script, profiler and performance output so that we can help you better.

Alternatively, you can try running your model with trtexec command
https://github.com/NVIDIA/TensorRT/tree/master/samples/opensource/trtexec
or view these tips for optimizing performance
https://docs.nvidia.com/deeplearning/tensorrt/best-practices/index.html

Thanks!

bpinaya · January 26, 2021, 3:48pm

I did use trtexec, and you can have access to the model, it’s on the tensorrt examples data folder.
For the custom model, I can’t share it, but I’ll share another one once I finish benchmarking it.
model.onnx (25.8 KB)

Did you test the model yourself?

spolisetty · January 28, 2021, 8:37am

Hi @bpinaya,

The both logs you’ve share have FP32.
Could you please share us commands you’re trying and logs for both.

[01/26/2021-15:51:50] [I] === Model Options ===                                                                                                                                                                      
[01/26/2021-15:51:50] [I] Format: ONNX                                                                                                                                                                               
[01/26/2021-15:51:50] [I] Model: /home/YOURUSER/Documents/tensorrt/data/mnist/mnist.onnx                                                                                                                               
[01/26/2021-15:51:50] [I] Output:                                                                                                                                                                                    
[01/26/2021-15:51:50] [I] === Build Options ===                                                                                                                                                                      
[01/26/2021-15:51:50] [I] Max batch: 1                                                                                                                                                                               
[01/26/2021-15:51:50] [I] Workspace: 1000 MB                                                                                                                                                                         
[01/26/2021-15:51:50] [I] minTiming: 1                                                                                                                                                                               
[01/26/2021-15:51:50] [I] avgTiming: 8                                                                                                                                                                               
[01/26/2021-15:51:50] [I] Precision: FP32

Thank you

bpinaya · February 8, 2021, 4:26pm

Sorry for the replay, in the case of FP32 with the command:

./trtexec --onnx=/usr/src/tensorrt/data/mnist/mnist.onnx --maxBatch=1 --batch=1 --workspace=100 --iterations=100 --avgRuns=100 --duration=10

&&&& RUNNING TensorRT.trtexec # ./trtexec --onnx=/usr/src/tensorrt/data/mnist/mnist.onnx --maxBatch=1 --batch=1 --workspace=100 --iterations=100 --avgRuns=100 --duration=10                                         
[02/08/2021-17:16:27] [I] === Model Options ===                                                                                                                                                                      
[02/08/2021-17:16:27] [I] Format: ONNX                                                                                                                                                                               
[02/08/2021-17:16:27] [I] Model: /usr/src/tensorrt/data/mnist/mnist.onnx                                                                                                                                             
[02/08/2021-17:16:27] [I] Output:                                                                                                                                                                                    
[02/08/2021-17:16:27] [I] === Build Options ===                                                                                                                                                                      
[02/08/2021-17:16:27] [I] Max batch: 1                                                                                                                                                                               
[02/08/2021-17:16:27] [I] Workspace: 100 MB                                                                                                                                                                          
[02/08/2021-17:16:27] [I] minTiming: 1                                                                                                                                                                               
[02/08/2021-17:16:27] [I] avgTiming: 8                                                                                                                                                                               
[02/08/2021-17:16:27] [I] Precision: FP32                                                                                                                                                                            
[02/08/2021-17:16:27] [I] Calibration:                                                                                                                                                                               
[02/08/2021-17:16:27] [I] Safe mode: Disabled                                                                                                                                                                        
[02/08/2021-17:16:27] [I] Save engine:                                                                                                                                                                               
[02/08/2021-17:16:27] [I] Load engine:                                                                                                                                                                               
[02/08/2021-17:16:27] [I] Builder Cache: Enabled                                                                                                                                                                     
[02/08/2021-17:16:27] [I] NVTX verbosity: 0                                                                                                                                                                          
[02/08/2021-17:16:27] [I] Inputs format: fp32:CHW                                                                                                                                                                    
[02/08/2021-17:16:27] [I] Outputs format: fp32:CHW                                                                                                                                                                   
[02/08/2021-17:16:27] [I] Input build shapes: model                                                                                                                                                                  
[02/08/2021-17:16:27] [I] Input calibration shapes: model                                                                                                                                                            
[02/08/2021-17:16:27] [I] === System Options ===                                                                                                                                                                     
[02/08/2021-17:16:27] [I] Device: 0                                                                                                                                                                                  
[02/08/2021-17:16:27] [I] DLACore:                                                                                                                                                                                   
[02/08/2021-17:16:27] [I] Plugins:                                                                                                                                                                                   
[02/08/2021-17:16:27] [I] === Inference Options ===                                                                                                                                                                  
[02/08/2021-17:16:27] [I] Batch: 1                                                                                                                                                                                   
[02/08/2021-17:16:27] [I] Input inference shapes: model                                                                                                                                                              
[02/08/2021-17:16:27] [I] Iterations: 100                                                                                                                                                                            
[02/08/2021-17:16:27] [I] Duration: 10s (+ 200ms warm up)                                                                                                                                                            
[02/08/2021-17:16:27] [I] Sleep time: 0ms                                                                                                                                                                            
[02/08/2021-17:16:27] [I] Streams: 1                                                                                                                                                                                 
[02/08/2021-17:16:27] [I] ExposeDMA: Disabled                                                                                                                                                                        
[02/08/2021-17:16:27] [I] Spin-wait: Disabled                                                                                                                                                                        
[02/08/2021-17:16:27] [I] Multithreading: Disabled                                                                                                                                                                   
[02/08/2021-17:16:27] [I] CUDA Graph: Disabled                                                                                                                                                                       
[02/08/2021-17:16:27] [I] Skip inference: Disabled                                                                                                                                                                   
[02/08/2021-17:16:27] [I] Inputs:                                                                                                                                                                                    
[02/08/2021-17:16:27] [I] === Reporting Options ===                                                                                                                                                                  
[02/08/2021-17:16:27] [I] Verbose: Disabled                                                                                                                                                                          
[02/08/2021-17:16:27] [I] Averages: 100 inferences                                                                                                                                                                   
[02/08/2021-17:16:27] [I] Percentile: 99
[02/08/2021-17:16:27] [I] Dump output: Disabled
[02/08/2021-17:16:27] [I] Profile: Disabled
[02/08/2021-17:16:27] [I] Export timing to JSON file:
[02/08/2021-17:16:27] [I] Export output to JSON file:
[02/08/2021-17:16:27] [I] Export profile to JSON file:
[02/08/2021-17:16:27] [I]
----------------------------------------------------------------
Input filename:   /usr/src/tensorrt/data/mnist/mnist.onnx
ONNX IR version:  0.0.3
Opset version:    8
Producer name:    CNTK
Producer version: 2.5.1
Domain:           ai.cntk
Model version:    1
Doc string:
----------------------------------------------------------------
[02/08/2021-17:16:31] [W] [TRT] onnx2trt_utils.cpp:220: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[02/08/2021-17:16:44] [I] [TRT] Detected 1 inputs and 1 output network tensors.
[02/08/2021-17:16:44] [I] Starting inference threads
[02/08/2021-17:16:54] [I] Warmup completed 486 queries over 200 ms
[02/08/2021-17:16:54] [I] Timing trace has 25127 queries over 10.0009 s
[02/08/2021-17:16:54] [I] Trace averages of 100 runs:
[02/08/2021-17:16:54] [I] Average on 100 runs - GPU latency: 0.224946 ms - Host latency: 0.308689 ms (end to end 0.357008 ms, enqueue 0.2062 ms)
[02/08/2021-17:16:54] [I] Average on 100 runs - GPU latency: 0.226971 ms - Host latency: 0.311384 ms (end to end 0.359442 ms, enqueue 0.207737 ms)
[02/08/2021-17:16:54] [I] Average on 100 runs - GPU latency: 0.231129 ms - Host latency: 0.314899 ms (end to end 0.362996 ms, enqueue 0.21158 ms)
[02/08/2021-17:16:54] [I] Average on 100 runs - GPU latency: 0.228179 ms - Host latency: 0.311455 ms (end to end 0.359976 ms, enqueue 0.208582 ms)
[02/08/2021-17:16:54] [I] Average on 100 runs - GPU latency: 0.226724 ms - Host latency: 0.310735 ms (end to end 0.35765 ms, enqueue 0.207265 ms)
[02/08/2021-17:16:54] [I] Average on 100 runs - GPU latency: 0.227407 ms - Host latency: 0.311124 ms (end to end 0.358841 ms, enqueue 0.207803 ms)
[02/08/2021-17:16:54] [I] Average on 100 runs - GPU latency: 0.227092 ms - Host latency: 0.310713 ms (end to end 0.358576 ms, enqueue 0.207697 ms)
[02/08/2021-17:16:54] [I] Average on 100 runs - GPU latency: 0.226056 ms - Host latency: 0.309488 ms (end to end 0.357423 ms, enqueue 0.206407 ms)
[02/08/2021-17:16:54] [I] Average on 100 runs - GPU latency: 0.228359 ms - Host latency: 0.311456 ms (end to end 0.359084 ms, enqueue 0.207947 ms)
[02/08/2021-17:16:54] [I] Average on 100 runs - GPU latency: 0.226696 ms - Host latency: 0.309482 ms (end to end 0.357095 ms, enqueue 0.206992 ms)
[02/08/2021-17:16:54] [I] Average on 100 runs - GPU latency: 0.227172 ms - Host latency: 0.309944 ms (end to end 0.357659 ms, enqueue 0.207119 ms)
[02/08/2021-17:16:54] [I] Average on 100 runs - GPU latency: 0.224816 ms - Host latency: 0.306652 ms (end to end 0.353093 ms, enqueue 0.204976 ms)
[02/08/2021-17:16:54] [I] Average on 100 runs - GPU latency: 0.2254 ms - Host latency: 0.30793 ms (end to end 0.355375 ms, enqueue 0.205803 ms)
[02/08/2021-17:16:54] [I] Average on 100 runs - GPU latency: 0.227792 ms - Host latency: 0.311299 ms (end to end 0.35897 ms, enqueue 0.207635 ms)
[02/08/2021-17:16:54] [I] Average on 100 runs - GPU latency: 0.272423 ms - Host latency: 0.353566 ms (end to end 0.43041 ms, enqueue 0.212768 ms)
[02/08/2021-17:16:54] [I] Average on 100 runs - GPU latency: 0.22612 ms - Host latency: 0.309473 ms (end to end 0.357571 ms, enqueue 0.206249 ms)
[02/08/2021-17:16:54] [I] Average on 100 runs - GPU latency: 0.229488 ms - Host latency: 0.313956 ms (end to end 0.362181 ms, enqueue 0.210325 ms)
[02/08/2021-17:16:54] [I] Average on 100 runs - GPU latency: 0.232393 ms - Host latency: 0.316422 ms (end to end 0.364607 ms, enqueue 0.213693 ms)
[02/08/2021-17:16:54] [I] Average on 100 runs - GPU latency: 0.226022 ms - Host latency: 0.308402 ms (end to end 0.355998 ms, enqueue 0.206119 ms)
[02/08/2021-17:16:54] [I] Average on 100 runs - GPU latency: 0.224191 ms - Host latency: 0.307045 ms (end to end 0.354913 ms, enqueue 0.204418 ms)
[02/08/2021-17:16:54] [I] Average on 100 runs - GPU latency: 0.225433 ms - Host latency: 0.308585 ms (end to end 0.35645 ms, enqueue 0.205869 ms)
[02/08/2021-17:16:54] [I] Average on 100 runs - GPU latency: 0.224426 ms - Host latency: 0.30688 ms (end to end 0.355066 ms, enqueue 0.20552 ms)
...
[02/08/2021-17:16:54] [I] Average on 100 runs - GPU latency: 0.224844 ms - Host latency: 0.307344 ms (end to end 0.355938 ms, enqueue 0.206904 ms)
[02/08/2021-17:16:54] [I] Average on 100 runs - GPU latency: 0.224795 ms - Host latency: 0.307461 ms (end to end 0.355791 ms, enqueue 0.206729 ms)
[02/08/2021-17:16:54] [I] Average on 100 runs - GPU latency: 0.227588 ms - Host latency: 0.310898 ms (end to end 0.358818 ms, enqueue 0.20915 ms)
[02/08/2021-17:16:54] [I] Average on 100 runs - GPU latency: 0.230586 ms - Host latency: 0.3146 ms (end to end 0.362334 ms, enqueue 0.211826 ms)
[02/08/2021-17:16:54] [I] Average on 100 runs - GPU latency: 0.225156 ms - Host latency: 0.307832 ms (end to end 0.356475 ms, enqueue 0.20748 ms)
[02/08/2021-17:16:54] [I] Average on 100 runs - GPU latency: 0.224355 ms - Host latency: 0.306875 ms (end to end 0.355254 ms, enqueue 0.206729 ms)
[02/08/2021-17:16:54] [I] Average on 100 runs - GPU latency: 0.227012 ms - Host latency: 0.309551 ms (end to end 0.357803 ms, enqueue 0.209307 ms)
[02/08/2021-17:16:54] [I] Average on 100 runs - GPU latency: 0.224424 ms - Host latency: 0.307197 ms (end to end 0.355176 ms, enqueue 0.206777 ms)
[02/08/2021-17:16:54] [I] Average on 100 runs - GPU latency: 0.223945 ms - Host latency: 0.30668 ms (end to end 0.354863 ms, enqueue 0.206367 ms)
[02/08/2021-17:16:54] [I] Average on 100 runs - GPU latency: 0.223291 ms - Host latency: 0.305869 ms (end to end 0.354268 ms, enqueue 0.20583 ms)
[02/08/2021-17:16:54] [I] Average on 100 runs - GPU latency: 0.224023 ms - Host latency: 0.306611 ms (end to end 0.354844 ms, enqueue 0.206074 ms)
[02/08/2021-17:16:54] [I] Host Latency
[02/08/2021-17:16:54] [I] min: 0.162598 ms (end to end 0.171875 ms)
[02/08/2021-17:16:54] [I] max: 4.65137 ms (end to end 7.4043 ms)
[02/08/2021-17:16:54] [I] mean: 0.311622 ms (end to end 0.361703 ms)
[02/08/2021-17:16:54] [I] median: 0.306641 ms (end to end 0.354492 ms)
[02/08/2021-17:16:54] [I] percentile: 0.353516 ms at 99% (end to end 0.40332 ms at 99%)
[02/08/2021-17:16:54] [I] throughput: 2512.47 qps
[02/08/2021-17:16:54] [I] walltime: 10.0009 s
[02/08/2021-17:16:54] [I] Enqueue Time
[02/08/2021-17:16:54] [I] min: 0.168945 ms
[02/08/2021-17:16:54] [I] max: 0.659668 ms
[02/08/2021-17:16:54] [I] median: 0.205566 ms
[02/08/2021-17:16:54] [I] GPU Compute
[02/08/2021-17:16:54] [I] min: 0.157715 ms
[02/08/2021-17:16:54] [I] max: 4.5918 ms
[02/08/2021-17:16:54] [I] mean: 0.228571 ms
[02/08/2021-17:16:54] [I] median: 0.224121 ms
[02/08/2021-17:16:54] [I] percentile: 0.26416 ms at 99%
[02/08/2021-17:16:54] [I] total compute time: 5.7433 s
&&&& PASSED TensorRT.trtexec # ./trtexec --onnx=/usr/src/tensorrt/data/mnist/mnist.onnx --maxBatch=1 --batch=1 --workspace=100 --iterations=100 --avgRuns=100 --duration=10

And with FP16 ran with the command:

./trtexec --onnx=/usr/src/tensorrt/data/mnist/mnist.onnx --fp16 --maxBatch=1 --batch=1 --workspace=100 --iterations=100 --avgRuns=100 --duration=10

&&&& RUNNING TensorRT.trtexec # ./trtexec --onnx=/usr/src/tensorrt/data/mnist/mnist.onnx --fp16 --maxBatch=1 --batch=1 --workspace=100 --iterations=100 --avgRuns=100 --duration=10                                  
[02/08/2021-17:22:03] [I] === Model Options ===                                                                                                                                                                      
[02/08/2021-17:22:03] [I] Format: ONNX                                                                                                                                                                               
[02/08/2021-17:22:03] [I] Model: /usr/src/tensorrt/data/mnist/mnist.onnx                                                                                                                                             
[02/08/2021-17:22:03] [I] Output:                                                                                                                                                                                    
[02/08/2021-17:22:03] [I] === Build Options ===                                                                                                                                                                      
[02/08/2021-17:22:03] [I] Max batch: 1                                                                                                                                                                               
[02/08/2021-17:22:03] [I] Workspace: 100 MB                                                                                                                                                                          
[02/08/2021-17:22:03] [I] minTiming: 1                                                                                                                                                                               
[02/08/2021-17:22:03] [I] avgTiming: 8                                                                                                                                                                               
[02/08/2021-17:22:03] [I] Precision: FP32+FP16                                                                                                                                                                       
[02/08/2021-17:22:03] [I] Calibration:                                                                                                                                                                               
[02/08/2021-17:22:03] [I] Safe mode: Disabled                                                                                                                                                                        
[02/08/2021-17:22:03] [I] Save engine:                                                                                                                                                                               
[02/08/2021-17:22:03] [I] Load engine:                                                                                                                                                                               
[02/08/2021-17:22:03] [I] Builder Cache: Enabled                                                                                                                                                                     
[02/08/2021-17:22:03] [I] NVTX verbosity: 0                                                                                                                                                                          
[02/08/2021-17:22:03] [I] Inputs format: fp32:CHW                                                                                                                                                                    
[02/08/2021-17:22:03] [I] Outputs format: fp32:CHW                                                                                                                                                                   
[02/08/2021-17:22:03] [I] Input build shapes: model                                                                                                                                                                  
[02/08/2021-17:22:03] [I] Input calibration shapes: model                                                                                                                                                            
[02/08/2021-17:22:03] [I] === System Options ===                                                                                                                                                                     
[02/08/2021-17:22:03] [I] Device: 0                                                                                                                                                                                  
[02/08/2021-17:22:03] [I] DLACore:                                                                                                                                                                                   
[02/08/2021-17:22:03] [I] Plugins:                                                                                                                                                                                   
[02/08/2021-17:22:03] [I] === Inference Options ===                                                                                                                                                                  
[02/08/2021-17:22:03] [I] Batch: 1                                                                                                                                                                                   
[02/08/2021-17:22:03] [I] Input inference shapes: model                                                                                                                                                              
[02/08/2021-17:22:03] [I] Iterations: 100                                                                                                                                                                            
[02/08/2021-17:22:03] [I] Duration: 10s (+ 200ms warm up)                                                                                                                                                            
[02/08/2021-17:22:03] [I] Sleep time: 0ms                                                                                                                                                                            
[02/08/2021-17:22:03] [I] Streams: 1                                                                                                                                                                                 
[02/08/2021-17:22:03] [I] ExposeDMA: Disabled                                                                                                                                                                        
[02/08/2021-17:22:03] [I] Spin-wait: Disabled                                                                                                                                                                        
[02/08/2021-17:22:03] [I] Multithreading: Disabled                                                                                                                                                                   
[02/08/2021-17:22:03] [I] CUDA Graph: Disabled                                                                                                                                                                       
[02/08/2021-17:22:03] [I] Skip inference: Disabled                                                                                                                                                                   
[02/08/2021-17:22:03] [I] Inputs:                                                                                                                                                                                    
[02/08/2021-17:22:03] [I] === Reporting Options ===                                                                                                                                                                  
[02/08/2021-17:22:03] [I] Verbose: Disabled
[02/08/2021-17:22:03] [I] Averages: 100 inferences
[02/08/2021-17:22:03] [I] Percentile: 99
[02/08/2021-17:22:03] [I] Dump output: Disabled
[02/08/2021-17:22:03] [I] Profile: Disabled
[02/08/2021-17:22:03] [I] Export timing to JSON file:
[02/08/2021-17:22:03] [I] Export output to JSON file:
[02/08/2021-17:22:03] [I] Export profile to JSON file:
[02/08/2021-17:22:03] [I]
----------------------------------------------------------------
Input filename:   /usr/src/tensorrt/data/mnist/mnist.onnx
ONNX IR version:  0.0.3
Opset version:    8
Producer name:    CNTK
Producer version: 2.5.1
Domain:           ai.cntk
Model version:    1
Doc string:
----------------------------------------------------------------
[02/08/2021-17:22:05] [W] [TRT] onnx2trt_utils.cpp:220: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[02/08/2021-17:22:15] [I] [TRT] Detected 1 inputs and 1 output network tensors.
[02/08/2021-17:22:15] [I] Starting inference threads
[02/08/2021-17:22:25] [I] Warmup completed 420 queries over 200 ms
[02/08/2021-17:22:25] [I] Timing trace has 20944 queries over 10.0006 s
[02/08/2021-17:22:25] [I] Trace averages of 100 runs:
[02/08/2021-17:22:25] [I] Average on 100 runs - GPU latency: 0.302454 ms - Host latency: 0.387792 ms (end to end 0.434759 ms, enqueue 0.285649 ms)
[02/08/2021-17:22:25] [I] Average on 100 runs - GPU latency: 0.297009 ms - Host latency: 0.380591 ms (end to end 0.427721 ms, enqueue 0.280243 ms)
[02/08/2021-17:22:25] [I] Average on 100 runs - GPU latency: 0.295072 ms - Host latency: 0.37873 ms (end to end 0.425765 ms, enqueue 0.27822 ms)
[02/08/2021-17:22:25] [I] Average on 100 runs - GPU latency: 0.338032 ms - Host latency: 0.421506 ms (end to end 0.501111 ms, enqueue 0.28653 ms)
[02/08/2021-17:22:25] [I] Average on 100 runs - GPU latency: 0.295096 ms - Host latency: 0.378307 ms (end to end 0.425086 ms, enqueue 0.277974 ms)
[02/08/2021-17:22:25] [I] Average on 100 runs - GPU latency: 0.296084 ms - Host latency: 0.377385 ms (end to end 0.424159 ms, enqueue 0.279276 ms)
[02/08/2021-17:22:25] [I] Average on 100 runs - GPU latency: 0.29523 ms - Host latency: 0.378461 ms (end to end 0.425536 ms, enqueue 0.278375 ms)
[02/08/2021-17:22:25] [I] Average on 100 runs - GPU latency: 0.296805 ms - Host latency: 0.381041 ms (end to end 0.428491 ms, enqueue 0.279676 ms)
[02/08/2021-17:22:25] [I] Average on 100 runs - GPU latency: 0.296536 ms - Host latency: 0.380564 ms (end to end 0.427844 ms, enqueue 0.279359 ms)
[02/08/2021-17:22:25] [I] Average on 100 runs - GPU latency: 0.295912 ms - Host latency: 0.379664 ms (end to end 0.42684 ms, enqueue 0.278975 ms)
[02/08/2021-17:22:25] [I] Average on 100 runs - GPU latency: 0.295323 ms - Host latency: 0.378668 ms (end to end 0.42564 ms, enqueue 0.278395 ms)
[02/08/2021-17:22:25] [I] Average on 100 runs - GPU latency: 0.336458 ms - Host latency: 0.422026 ms (end to end 0.491127 ms, enqueue 0.285174 ms)
[02/08/2021-17:22:25] [I] Average on 100 runs - GPU latency: 0.296577 ms - Host latency: 0.380717 ms (end to end 0.427925 ms, enqueue 0.27942 ms)
[02/08/2021-17:22:25] [I] Average on 100 runs - GPU latency: 0.29508 ms - Host latency: 0.378663 ms (end to end 0.425478 ms, enqueue 0.278277 ms)
[02/08/2021-17:22:25] [I] Average on 100 runs - GPU latency: 0.295883 ms - Host latency: 0.379551 ms (end to end 0.426779 ms, enqueue 0.278883 ms)
[02/08/2021-17:22:25] [I] Average on 100 runs - GPU latency: 0.29626 ms - Host latency: 0.379642 ms (end to end 0.426787 ms, enqueue 0.279134 ms)
[02/08/2021-17:22:25] [I] Average on 100 runs - GPU latency: 0.297139 ms - Host latency: 0.381187 ms (end to end 0.428563 ms, enqueue 0.28016 ms)
[02/08/2021-17:22:25] [I] Average on 100 runs - GPU latency: 0.295022 ms - Host latency: 0.377205 ms (end to end 0.423596 ms, enqueue 0.278165 ms)
[02/08/2021-17:22:25] [I] Average on 100 runs - GPU latency: 0.294985 ms - Host latency: 0.378489 ms (end to end 0.425151 ms, enqueue 0.27826 ms)
[02/08/2021-17:22:25] [I] Average on 100 runs - GPU latency: 0.292606 ms - Host latency: 0.373783 ms (end to end 0.419537 ms, enqueue 0.276064 ms)
[02/08/2021-17:22:25] [I] Average on 100 runs - GPU latency: 0.295953 ms - Host latency: 0.378842 ms (end to end 0.42561 ms, enqueue 0.279082 ms)
[02/08/2021-17:22:25] [I] Average on 100 runs - GPU latency: 0.301597 ms - Host latency: 0.386682 ms (end to end 0.433386 ms, enqueue 0.28431 ms)
[02/08/2021-17:22:25] [I] Average on 100 runs - GPU latency: 0.296274 ms - Host latency: 0.379974 ms (end to end 0.427578 ms, enqueue 0.27941 ms)
[02/08/2021-17:22:25] [I] Average on 100 runs - GPU latency: 0.295778 ms - Host latency: 0.379113 ms (end to end 0.427043 ms, enqueue 0.278766 ms)
[02/08/2021-17:22:25] [I] Average on 100 runs - GPU latency: 0.296348 ms - Host latency: 0.38024 ms (end to end 0.427545 ms, enqueue 0.279364 ms)
[02/08/2021-17:22:25] [I] Average on 100 runs - GPU latency: 0.297646 ms - Host latency: 0.381302 ms (end to end 0.428733 ms, enqueue 0.280618 ms)
[02/08/2021-17:22:25] [I] Average on 100 runs - GPU latency: 0.295342 ms - Host latency: 0.378413 ms (end to end 0.425198 ms, enqueue 0.278523 ms)
...
[02/08/2021-17:22:25] [I] Average on 100 runs - GPU latency: 0.283145 ms - Host latency: 0.36584 ms (end to end 0.413311 ms, enqueue 0.266533 ms)
[02/08/2021-17:22:25] [I] Average on 100 runs - GPU latency: 0.283516 ms - Host latency: 0.36584 ms (end to end 0.413203 ms, enqueue 0.266924 ms)
[02/08/2021-17:22:25] [I] Average on 100 runs - GPU latency: 0.283428 ms - Host latency: 0.365312 ms (end to end 0.412598 ms, enqueue 0.266797 ms)
[02/08/2021-17:22:25] [I] Average on 100 runs - GPU latency: 0.283066 ms - Host latency: 0.365381 ms (end to end 0.412637 ms, enqueue 0.266455 ms)
[02/08/2021-17:22:25] [I] Average on 100 runs - GPU latency: 0.282246 ms - Host latency: 0.364424 ms (end to end 0.41165 ms, enqueue 0.265693 ms)
[02/08/2021-17:22:25] [I] Average on 100 runs - GPU latency: 0.281592 ms - Host latency: 0.363701 ms (end to end 0.410391 ms, enqueue 0.265107 ms)
[02/08/2021-17:22:25] [I] Average on 100 runs - GPU latency: 0.337539 ms - Host latency: 0.417627 ms (end to end 0.505605 ms, enqueue 0.273242 ms)
[02/08/2021-17:22:25] [I] Average on 100 runs - GPU latency: 0.284434 ms - Host latency: 0.366729 ms (end to end 0.414473 ms, enqueue 0.267764 ms)
[02/08/2021-17:22:25] [I] Average on 100 runs - GPU latency: 0.283604 ms - Host latency: 0.365684 ms (end to end 0.412959 ms, enqueue 0.266943 ms)
[02/08/2021-17:22:25] [I] Average on 100 runs - GPU latency: 0.283008 ms - Host latency: 0.364805 ms (end to end 0.412451 ms, enqueue 0.26625 ms)
[02/08/2021-17:22:25] [I] Average on 100 runs - GPU latency: 0.279766 ms - Host latency: 0.361064 ms (end to end 0.408301 ms, enqueue 0.26332 ms)
[02/08/2021-17:22:25] [I] Host Latency
[02/08/2021-17:22:25] [I] min: 0.162292 ms (end to end 0.235352 ms)
[02/08/2021-17:22:25] [I] max: 5.4082 ms (end to end 8.24463 ms)
[02/08/2021-17:22:25] [I] mean: 0.388523 ms (end to end 0.440537 ms)
[02/08/2021-17:22:25] [I] median: 0.371094 ms (end to end 0.418457 ms)
[02/08/2021-17:22:25] [I] percentile: 0.482666 ms at 99% (end to end 0.544678 ms at 99%)
[02/08/2021-17:22:25] [I] throughput: 2094.28 qps
[02/08/2021-17:22:25] [I] walltime: 10.0006 s
[02/08/2021-17:22:25] [I] Enqueue Time
[02/08/2021-17:22:25] [I] min: 0.230469 ms
[02/08/2021-17:22:25] [I] max: 0.708008 ms
[02/08/2021-17:22:25] [I] median: 0.270996 ms
[02/08/2021-17:22:25] [I] GPU Compute
[02/08/2021-17:22:25] [I] min: 0.129395 ms
[02/08/2021-17:22:25] [I] max: 5.35547 ms
[02/08/2021-17:22:25] [I] mean: 0.301913 ms
[02/08/2021-17:22:25] [I] median: 0.288086 ms
[02/08/2021-17:22:25] [I] percentile: 0.372314 ms at 99%
[02/08/2021-17:22:25] [I] total compute time: 6.32326 s
&&&& PASSED TensorRT.trtexec # ./trtexec --onnx=/usr/src/tensorrt/data/mnist/mnist.onnx --fp16 --maxBatch=1 --batch=1 --workspace=100 --iterations=100 --avgRuns=100 --duration=10

Actually this time it gets even worse on fp16, I know when using both fp16 and fp32 the conversions can actually make this not worth it. But with my custom model (which I can’t publish here) the times on fp32 and fp16 are exactly the same on nano, while on other platforms like Xavier or TX2, I got significant gains.

bpinaya · February 19, 2021, 2:53pm

Well, I keep on testing this, I tried another network from the jetson-inference repo:

function download_fcn_resnet18_cityscapes_512x256()
{
	echo "$LOG Downloading FCN-ResNet18-Cityscapes-512x256..."
	download_archive "FCN-ResNet18-Cityscapes-512x256.tar.gz" "https://nvidia.box.com/shared/static/k7s7gdgi098309fndm2xbssj553vf71s.gz" 
}

From the Nvidia Box for segmentation and there I do see improvement on times, I’m starting to think it’s a layer that is not supported, but as I mentioned before this behaviour did not happen on Xavier or TX2.

I’m using the Support Matrix :: NVIDIA Deep Learning TensorRT Documentation supported layers for ONNX, any way to use trtexecto check individual layer timings? I used verbose and the output is too much, same with trying to export the timings to a json.

spolisetty · February 22, 2021, 5:27am

Hi @bpinaya,

Could you please try --workspace=1024 in trtexec command.

Thank you