Hi,
There are two kinds of pooling: exclusive pooling and inclusive pooling.
You can find the details in our document below:
https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#pooling-layer
By default, average pooling is performed on the overlap between the pooling window and the padded input. If the exclusive parameter is set to true , the average pooling is performed on the overlap area between the pooling window and unpadded input.
The default value of count_include_pad
in ONNX is 0, i.e. exclusive pooling: onnx/Operators.md at main · onnx/onnx · GitHub
To make the layer works on DLA, the count_include_pad
need to be set to 1 for inclusive pooling.
This can be done by using our ONNX Graphsurgeon.
Please install it from here first.
Then try the attached script: set_count_include_pad.py (302 Bytes)
$ python3 set_count_include_pad.py
$ /usr/src/tensorrt/bin/trtexec --onnx=updated_resize_Avg_pool.onnx --useDLACore=0
&&&& RUNNING TensorRT.trtexec [TensorRT v8201] # /usr/src/tensorrt/bin/trtexec --onnx=updated_resize_Avg_pool.onnx --useDLACore=0
[07/12/2022-12:48:13] [I] === Model Options ===
[07/12/2022-12:48:13] [I] Format: ONNX
[07/12/2022-12:48:13] [I] Model: updated_resize_Avg_pool.onnx
[07/12/2022-12:48:13] [I] Output:
[07/12/2022-12:48:13] [I] === Build Options ===
[07/12/2022-12:48:13] [I] Max batch: explicit batch
[07/12/2022-12:48:13] [I] Workspace: 16 MiB
[07/12/2022-12:48:13] [I] minTiming: 1
[07/12/2022-12:48:13] [I] avgTiming: 8
[07/12/2022-12:48:13] [I] Precision: FP32
[07/12/2022-12:48:13] [I] Calibration:
[07/12/2022-12:48:13] [I] Refit: Disabled
[07/12/2022-12:48:13] [I] Sparsity: Disabled
[07/12/2022-12:48:13] [I] Safe mode: Disabled
[07/12/2022-12:48:13] [I] DirectIO mode: Disabled
[07/12/2022-12:48:13] [I] Restricted mode: Disabled
[07/12/2022-12:48:13] [I] Save engine:
[07/12/2022-12:48:13] [I] Load engine:
[07/12/2022-12:48:13] [I] Profiling verbosity: 0
[07/12/2022-12:48:13] [I] Tactic sources: Using default tactic sources
[07/12/2022-12:48:13] [I] timingCacheMode: local
[07/12/2022-12:48:13] [I] timingCacheFile:
[07/12/2022-12:48:13] [I] Input(s)s format: fp32:CHW
[07/12/2022-12:48:13] [I] Output(s)s format: fp32:CHW
[07/12/2022-12:48:13] [I] Input build shapes: model
[07/12/2022-12:48:13] [I] Input calibration shapes: model
[07/12/2022-12:48:13] [I] === System Options ===
[07/12/2022-12:48:13] [I] Device: 0
[07/12/2022-12:48:13] [I] DLACore: 0
[07/12/2022-12:48:13] [I] Plugins:
[07/12/2022-12:48:13] [I] === Inference Options ===
[07/12/2022-12:48:13] [I] Batch: Explicit
[07/12/2022-12:48:13] [I] Input inference shapes: model
[07/12/2022-12:48:13] [I] Iterations: 10
[07/12/2022-12:48:13] [I] Duration: 3s (+ 200ms warm up)
[07/12/2022-12:48:13] [I] Sleep time: 0ms
[07/12/2022-12:48:13] [I] Idle time: 0ms
[07/12/2022-12:48:13] [I] Streams: 1
[07/12/2022-12:48:13] [I] ExposeDMA: Disabled
[07/12/2022-12:48:13] [I] Data transfers: Enabled
[07/12/2022-12:48:13] [I] Spin-wait: Disabled
[07/12/2022-12:48:13] [I] Multithreading: Disabled
[07/12/2022-12:48:13] [I] CUDA Graph: Disabled
[07/12/2022-12:48:13] [I] Separate profiling: Disabled
[07/12/2022-12:48:13] [I] Time Deserialize: Disabled
[07/12/2022-12:48:13] [I] Time Refit: Disabled
[07/12/2022-12:48:13] [I] Skip inference: Disabled
[07/12/2022-12:48:13] [I] Inputs:
[07/12/2022-12:48:13] [I] === Reporting Options ===
[07/12/2022-12:48:13] [I] Verbose: Disabled
[07/12/2022-12:48:13] [I] Averages: 10 inferences
[07/12/2022-12:48:13] [I] Percentile: 99
[07/12/2022-12:48:13] [I] Dump refittable layers:Disabled
[07/12/2022-12:48:13] [I] Dump output: Disabled
[07/12/2022-12:48:13] [I] Profile: Disabled
[07/12/2022-12:48:13] [I] Export timing to JSON file:
[07/12/2022-12:48:13] [I] Export output to JSON file:
[07/12/2022-12:48:13] [I] Export profile to JSON file:
[07/12/2022-12:48:13] [I]
[07/12/2022-12:48:13] [I] === Device Information ===
[07/12/2022-12:48:13] [I] Selected Device: Xavier
[07/12/2022-12:48:13] [I] Compute Capability: 7.2
[07/12/2022-12:48:13] [I] SMs: 8
[07/12/2022-12:48:13] [I] Compute Clock Rate: 1.377 GHz
[07/12/2022-12:48:13] [I] Device Global Memory: 31920 MiB
[07/12/2022-12:48:13] [I] Shared Memory per SM: 96 KiB
[07/12/2022-12:48:13] [I] Memory Bus Width: 256 bits (ECC disabled)
[07/12/2022-12:48:13] [I] Memory Clock Rate: 1.377 GHz
[07/12/2022-12:48:13] [I]
[07/12/2022-12:48:13] [I] TensorRT version: 8.2.1
[07/12/2022-12:48:14] [I] [TRT] [MemUsageChange] Init CUDA: CPU +363, GPU +0, now: CPU 381, GPU 7291 (MiB)
[07/12/2022-12:48:14] [I] [TRT] [MemUsageSnapshot] Begin constructing builder kernel library: CPU 381 MiB, GPU 7291 MiB
[07/12/2022-12:48:14] [I] [TRT] [MemUsageSnapshot] End constructing builder kernel library: CPU 486 MiB, GPU 7396 MiB
[07/12/2022-12:48:14] [I] Start parsing network model
[07/12/2022-12:48:14] [I] [TRT] ----------------------------------------------------------------
[07/12/2022-12:48:14] [I] [TRT] Input filename: updated_resize_Avg_pool.onnx
[07/12/2022-12:48:14] [I] [TRT] ONNX IR version: 0.0.7
[07/12/2022-12:48:14] [I] [TRT] Opset version: 11
[07/12/2022-12:48:14] [I] [TRT] Producer name: onnx-example
[07/12/2022-12:48:14] [I] [TRT] Producer version:
[07/12/2022-12:48:14] [I] [TRT] Domain:
[07/12/2022-12:48:14] [I] [TRT] Model version: 0
[07/12/2022-12:48:14] [I] [TRT] Doc string:
[07/12/2022-12:48:14] [I] [TRT] ----------------------------------------------------------------
[07/12/2022-12:48:14] [I] Finish parsing network model
[07/12/2022-12:48:14] [I] [TRT] ---------- Layers Running on DLA ----------
[07/12/2022-12:48:14] [I] [TRT] [DlaLayer] {ForeignNode[node_of_resize_out...node_of_output]}
[07/12/2022-12:48:14] [I] [TRT] ---------- Layers Running on GPU ----------
[07/12/2022-12:48:15] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +226, GPU +227, now: CPU 713, GPU 7627 (MiB)
[07/12/2022-12:48:16] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +308, GPU +327, now: CPU 1021, GPU 7954 (MiB)
[07/12/2022-12:48:16] [I] [TRT] Local timing cache in use. Profiling results in this builder pass will not be stored.
[07/12/2022-12:48:17] [I] [TRT] Detected 1 inputs and 1 output network tensors.
[07/12/2022-12:48:17] [I] [TRT] Total Host Persistent Memory: 848
[07/12/2022-12:48:17] [I] [TRT] Total Device Persistent Memory: 0
[07/12/2022-12:48:17] [I] [TRT] Total Scratch Memory: 0
[07/12/2022-12:48:17] [I] [TRT] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 0 MiB, GPU 0 MiB
[07/12/2022-12:48:17] [I] [TRT] [BlockAssignment] Algorithm ShiftNTopDown took 0.025154ms to assign 1 blocks to 1 nodes requiring 2048 bytes.
[07/12/2022-12:48:17] [I] [TRT] Total Activation Memory: 2048
[07/12/2022-12:48:17] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 1024, GPU 7964 (MiB)
[07/12/2022-12:48:17] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 1024, GPU 7972 (MiB)
[07/12/2022-12:48:17] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +0, GPU +0, now: CPU 0, GPU 0 (MiB)
[07/12/2022-12:48:17] [I] [TRT] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 1023, GPU 7972 (MiB)
[07/12/2022-12:48:17] [I] [TRT] Loaded engine size: 0 MiB
[07/12/2022-12:48:17] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 1024, GPU 7972 (MiB)
[07/12/2022-12:48:17] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +0, now: CPU 1024, GPU 7972 (MiB)
[07/12/2022-12:48:17] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +0, now: CPU 0, GPU 0 (MiB)
[07/12/2022-12:48:17] [I] Engine built in 3.80514 sec.
[07/12/2022-12:48:17] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 919, GPU 7972 (MiB)
[07/12/2022-12:48:17] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +1, GPU +0, now: CPU 920, GPU 7972 (MiB)
[07/12/2022-12:48:17] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 0 (MiB)
[07/12/2022-12:48:17] [I] Using random values for input input
[07/12/2022-12:48:17] [I] Created input binding for input with dimensions 1x1x8x8
[07/12/2022-12:48:17] [I] Using random values for output output
[07/12/2022-12:48:17] [I] Created output binding for output with dimensions 1x1x16x16
[07/12/2022-12:48:17] [I] Starting inference
[07/12/2022-12:48:20] [I] Warmup completed 428 queries over 200 ms
[07/12/2022-12:48:20] [I] Timing trace has 7930 queries over 3.00089 s
[07/12/2022-12:48:20] [I]
[07/12/2022-12:48:20] [I] === Trace details ===
[07/12/2022-12:48:20] [I] Trace averages of 10 runs:
[07/12/2022-12:48:20] [I] Average on 10 runs - GPU latency: 0.308214 ms - Host latency: 0.343268 ms (end to end 0.362955 ms, enqueue 0.304158 ms)
...
[07/12/2022-12:48:20] [I] Average on 10 runs - GPU latency: 0.26167 ms - Host latency: 0.285376 ms (end to end 0.297095 ms, enqueue 0.25918 ms)
[07/12/2022-12:48:20] [I]
[07/12/2022-12:48:20] [I] === Performance summary ===
[07/12/2022-12:48:20] [I] Throughput: 2642.55 qps
[07/12/2022-12:48:20] [I] Latency: min = 0.244629 ms, max = 0.53418 ms, mean = 0.30179 ms, median = 0.296875 ms, percentile(99%) = 0.391663 ms
[07/12/2022-12:48:20] [I] End-to-End Host Latency: min = 0.255127 ms, max = 0.549652 ms, mean = 0.314915 ms, median = 0.309692 ms, percentile(99%) = 0.408936 ms
[07/12/2022-12:48:20] [I] Enqueue Time: min = 0.220459 ms, max = 0.498138 ms, mean = 0.270443 ms, median = 0.265625 ms, percentile(99%) = 0.355103 ms
[07/12/2022-12:48:20] [I] H2D Latency: min = 0.00878906 ms, max = 0.0775146 ms, mean = 0.0134599 ms, median = 0.0119629 ms, percentile(99%) = 0.0343018 ms
[07/12/2022-12:48:20] [I] GPU Compute Time: min = 0.223145 ms, max = 0.502747 ms, mean = 0.273577 ms, median = 0.268311 ms, percentile(99%) = 0.358429 ms
[07/12/2022-12:48:20] [I] D2H Latency: min = 0.0112305 ms, max = 0.118896 ms, mean = 0.0147551 ms, median = 0.0130615 ms, percentile(99%) = 0.0366821 ms
[07/12/2022-12:48:20] [I] Total Host Walltime: 3.00089 s
[07/12/2022-12:48:20] [I] Total GPU Compute Time: 2.16946 s
[07/12/2022-12:48:20] [W] * Throughput may be bound by Enqueue Time rather than GPU Compute and the GPU may be under-utilized.
[07/12/2022-12:48:20] [W] If not already in use, --useCudaGraph (utilize CUDA graphs where possible) may increase the throughput.
[07/12/2022-12:48:20] [I] Explanations of the performance metrics are printed in the verbose logs.
[07/12/2022-12:48:20] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v8201] # /usr/src/tensorrt/bin/trtexec --onnx=updated_resize_Avg_pool.onnx --useDLACore=0
Thanks.