Xavier DLA does not support exclusive average pooling

Hi,
I was trying to run a model with resize and average pool on DLA, but it was showing the error mentioned,

node_of_output: Xavier DLA does not support exclusive average pooling.

Can please you say why is the average pool is not supported on DLA and what does exclusive average pooling mean?

Attaching the google drive link of the model,

Hi,

Confirmed that we can reproduce the same error with your model.

We are checking this with our internal team.
Will share more information with you later.

Thanks.

okay, I will wait for the reply.
thank you

Hi,

There are two kinds of pooling: exclusive pooling and inclusive pooling.
You can find the details in our document below:

https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#pooling-layer

By default, average pooling is performed on the overlap between the pooling window and the padded input. If the exclusive parameter is set to true , the average pooling is performed on the overlap area between the pooling window and unpadded input.

The default value of count_include_pad in ONNX is 0, i.e. exclusive pooling: onnx/Operators.md at main · onnx/onnx · GitHub

To make the layer works on DLA, the count_include_pad need to be set to 1 for inclusive pooling.
This can be done by using our ONNX Graphsurgeon.

Please install it from here first.
Then try the attached script: set_count_include_pad.py (302 Bytes)

$ python3 set_count_include_pad.py
$ /usr/src/tensorrt/bin/trtexec --onnx=updated_resize_Avg_pool.onnx --useDLACore=0
&&&& RUNNING TensorRT.trtexec [TensorRT v8201] # /usr/src/tensorrt/bin/trtexec --onnx=updated_resize_Avg_pool.onnx --useDLACore=0
[07/12/2022-12:48:13] [I] === Model Options ===
[07/12/2022-12:48:13] [I] Format: ONNX
[07/12/2022-12:48:13] [I] Model: updated_resize_Avg_pool.onnx
[07/12/2022-12:48:13] [I] Output:
[07/12/2022-12:48:13] [I] === Build Options ===
[07/12/2022-12:48:13] [I] Max batch: explicit batch
[07/12/2022-12:48:13] [I] Workspace: 16 MiB
[07/12/2022-12:48:13] [I] minTiming: 1
[07/12/2022-12:48:13] [I] avgTiming: 8
[07/12/2022-12:48:13] [I] Precision: FP32
[07/12/2022-12:48:13] [I] Calibration:
[07/12/2022-12:48:13] [I] Refit: Disabled
[07/12/2022-12:48:13] [I] Sparsity: Disabled
[07/12/2022-12:48:13] [I] Safe mode: Disabled
[07/12/2022-12:48:13] [I] DirectIO mode: Disabled
[07/12/2022-12:48:13] [I] Restricted mode: Disabled
[07/12/2022-12:48:13] [I] Save engine:
[07/12/2022-12:48:13] [I] Load engine:
[07/12/2022-12:48:13] [I] Profiling verbosity: 0
[07/12/2022-12:48:13] [I] Tactic sources: Using default tactic sources
[07/12/2022-12:48:13] [I] timingCacheMode: local
[07/12/2022-12:48:13] [I] timingCacheFile:
[07/12/2022-12:48:13] [I] Input(s)s format: fp32:CHW
[07/12/2022-12:48:13] [I] Output(s)s format: fp32:CHW
[07/12/2022-12:48:13] [I] Input build shapes: model
[07/12/2022-12:48:13] [I] Input calibration shapes: model
[07/12/2022-12:48:13] [I] === System Options ===
[07/12/2022-12:48:13] [I] Device: 0
[07/12/2022-12:48:13] [I] DLACore: 0
[07/12/2022-12:48:13] [I] Plugins:
[07/12/2022-12:48:13] [I] === Inference Options ===
[07/12/2022-12:48:13] [I] Batch: Explicit
[07/12/2022-12:48:13] [I] Input inference shapes: model
[07/12/2022-12:48:13] [I] Iterations: 10
[07/12/2022-12:48:13] [I] Duration: 3s (+ 200ms warm up)
[07/12/2022-12:48:13] [I] Sleep time: 0ms
[07/12/2022-12:48:13] [I] Idle time: 0ms
[07/12/2022-12:48:13] [I] Streams: 1
[07/12/2022-12:48:13] [I] ExposeDMA: Disabled
[07/12/2022-12:48:13] [I] Data transfers: Enabled
[07/12/2022-12:48:13] [I] Spin-wait: Disabled
[07/12/2022-12:48:13] [I] Multithreading: Disabled
[07/12/2022-12:48:13] [I] CUDA Graph: Disabled
[07/12/2022-12:48:13] [I] Separate profiling: Disabled
[07/12/2022-12:48:13] [I] Time Deserialize: Disabled
[07/12/2022-12:48:13] [I] Time Refit: Disabled
[07/12/2022-12:48:13] [I] Skip inference: Disabled
[07/12/2022-12:48:13] [I] Inputs:
[07/12/2022-12:48:13] [I] === Reporting Options ===
[07/12/2022-12:48:13] [I] Verbose: Disabled
[07/12/2022-12:48:13] [I] Averages: 10 inferences
[07/12/2022-12:48:13] [I] Percentile: 99
[07/12/2022-12:48:13] [I] Dump refittable layers:Disabled
[07/12/2022-12:48:13] [I] Dump output: Disabled
[07/12/2022-12:48:13] [I] Profile: Disabled
[07/12/2022-12:48:13] [I] Export timing to JSON file:
[07/12/2022-12:48:13] [I] Export output to JSON file:
[07/12/2022-12:48:13] [I] Export profile to JSON file:
[07/12/2022-12:48:13] [I]
[07/12/2022-12:48:13] [I] === Device Information ===
[07/12/2022-12:48:13] [I] Selected Device: Xavier
[07/12/2022-12:48:13] [I] Compute Capability: 7.2
[07/12/2022-12:48:13] [I] SMs: 8
[07/12/2022-12:48:13] [I] Compute Clock Rate: 1.377 GHz
[07/12/2022-12:48:13] [I] Device Global Memory: 31920 MiB
[07/12/2022-12:48:13] [I] Shared Memory per SM: 96 KiB
[07/12/2022-12:48:13] [I] Memory Bus Width: 256 bits (ECC disabled)
[07/12/2022-12:48:13] [I] Memory Clock Rate: 1.377 GHz
[07/12/2022-12:48:13] [I]
[07/12/2022-12:48:13] [I] TensorRT version: 8.2.1
[07/12/2022-12:48:14] [I] [TRT] [MemUsageChange] Init CUDA: CPU +363, GPU +0, now: CPU 381, GPU 7291 (MiB)
[07/12/2022-12:48:14] [I] [TRT] [MemUsageSnapshot] Begin constructing builder kernel library: CPU 381 MiB, GPU 7291 MiB
[07/12/2022-12:48:14] [I] [TRT] [MemUsageSnapshot] End constructing builder kernel library: CPU 486 MiB, GPU 7396 MiB
[07/12/2022-12:48:14] [I] Start parsing network model
[07/12/2022-12:48:14] [I] [TRT] ----------------------------------------------------------------
[07/12/2022-12:48:14] [I] [TRT] Input filename:   updated_resize_Avg_pool.onnx
[07/12/2022-12:48:14] [I] [TRT] ONNX IR version:  0.0.7
[07/12/2022-12:48:14] [I] [TRT] Opset version:    11
[07/12/2022-12:48:14] [I] [TRT] Producer name:    onnx-example
[07/12/2022-12:48:14] [I] [TRT] Producer version:
[07/12/2022-12:48:14] [I] [TRT] Domain:
[07/12/2022-12:48:14] [I] [TRT] Model version:    0
[07/12/2022-12:48:14] [I] [TRT] Doc string:
[07/12/2022-12:48:14] [I] [TRT] ----------------------------------------------------------------
[07/12/2022-12:48:14] [I] Finish parsing network model
[07/12/2022-12:48:14] [I] [TRT] ---------- Layers Running on DLA ----------
[07/12/2022-12:48:14] [I] [TRT] [DlaLayer] {ForeignNode[node_of_resize_out...node_of_output]}
[07/12/2022-12:48:14] [I] [TRT] ---------- Layers Running on GPU ----------
[07/12/2022-12:48:15] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +226, GPU +227, now: CPU 713, GPU 7627 (MiB)
[07/12/2022-12:48:16] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +308, GPU +327, now: CPU 1021, GPU 7954 (MiB)
[07/12/2022-12:48:16] [I] [TRT] Local timing cache in use. Profiling results in this builder pass will not be stored.
[07/12/2022-12:48:17] [I] [TRT] Detected 1 inputs and 1 output network tensors.
[07/12/2022-12:48:17] [I] [TRT] Total Host Persistent Memory: 848
[07/12/2022-12:48:17] [I] [TRT] Total Device Persistent Memory: 0
[07/12/2022-12:48:17] [I] [TRT] Total Scratch Memory: 0
[07/12/2022-12:48:17] [I] [TRT] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 0 MiB, GPU 0 MiB
[07/12/2022-12:48:17] [I] [TRT] [BlockAssignment] Algorithm ShiftNTopDown took 0.025154ms to assign 1 blocks to 1 nodes requiring 2048 bytes.
[07/12/2022-12:48:17] [I] [TRT] Total Activation Memory: 2048
[07/12/2022-12:48:17] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 1024, GPU 7964 (MiB)
[07/12/2022-12:48:17] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 1024, GPU 7972 (MiB)
[07/12/2022-12:48:17] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +0, GPU +0, now: CPU 0, GPU 0 (MiB)
[07/12/2022-12:48:17] [I] [TRT] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 1023, GPU 7972 (MiB)
[07/12/2022-12:48:17] [I] [TRT] Loaded engine size: 0 MiB
[07/12/2022-12:48:17] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 1024, GPU 7972 (MiB)
[07/12/2022-12:48:17] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +0, now: CPU 1024, GPU 7972 (MiB)
[07/12/2022-12:48:17] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +0, now: CPU 0, GPU 0 (MiB)
[07/12/2022-12:48:17] [I] Engine built in 3.80514 sec.
[07/12/2022-12:48:17] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 919, GPU 7972 (MiB)
[07/12/2022-12:48:17] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +1, GPU +0, now: CPU 920, GPU 7972 (MiB)
[07/12/2022-12:48:17] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 0 (MiB)
[07/12/2022-12:48:17] [I] Using random values for input input
[07/12/2022-12:48:17] [I] Created input binding for input with dimensions 1x1x8x8
[07/12/2022-12:48:17] [I] Using random values for output output
[07/12/2022-12:48:17] [I] Created output binding for output with dimensions 1x1x16x16
[07/12/2022-12:48:17] [I] Starting inference
[07/12/2022-12:48:20] [I] Warmup completed 428 queries over 200 ms
[07/12/2022-12:48:20] [I] Timing trace has 7930 queries over 3.00089 s
[07/12/2022-12:48:20] [I]
[07/12/2022-12:48:20] [I] === Trace details ===
[07/12/2022-12:48:20] [I] Trace averages of 10 runs:
[07/12/2022-12:48:20] [I] Average on 10 runs - GPU latency: 0.308214 ms - Host latency: 0.343268 ms (end to end 0.362955 ms, enqueue 0.304158 ms)
...
[07/12/2022-12:48:20] [I] Average on 10 runs - GPU latency: 0.26167 ms - Host latency: 0.285376 ms (end to end 0.297095 ms, enqueue 0.25918 ms)
[07/12/2022-12:48:20] [I]
[07/12/2022-12:48:20] [I] === Performance summary ===
[07/12/2022-12:48:20] [I] Throughput: 2642.55 qps
[07/12/2022-12:48:20] [I] Latency: min = 0.244629 ms, max = 0.53418 ms, mean = 0.30179 ms, median = 0.296875 ms, percentile(99%) = 0.391663 ms
[07/12/2022-12:48:20] [I] End-to-End Host Latency: min = 0.255127 ms, max = 0.549652 ms, mean = 0.314915 ms, median = 0.309692 ms, percentile(99%) = 0.408936 ms
[07/12/2022-12:48:20] [I] Enqueue Time: min = 0.220459 ms, max = 0.498138 ms, mean = 0.270443 ms, median = 0.265625 ms, percentile(99%) = 0.355103 ms
[07/12/2022-12:48:20] [I] H2D Latency: min = 0.00878906 ms, max = 0.0775146 ms, mean = 0.0134599 ms, median = 0.0119629 ms, percentile(99%) = 0.0343018 ms
[07/12/2022-12:48:20] [I] GPU Compute Time: min = 0.223145 ms, max = 0.502747 ms, mean = 0.273577 ms, median = 0.268311 ms, percentile(99%) = 0.358429 ms
[07/12/2022-12:48:20] [I] D2H Latency: min = 0.0112305 ms, max = 0.118896 ms, mean = 0.0147551 ms, median = 0.0130615 ms, percentile(99%) = 0.0366821 ms
[07/12/2022-12:48:20] [I] Total Host Walltime: 3.00089 s
[07/12/2022-12:48:20] [I] Total GPU Compute Time: 2.16946 s
[07/12/2022-12:48:20] [W] * Throughput may be bound by Enqueue Time rather than GPU Compute and the GPU may be under-utilized.
[07/12/2022-12:48:20] [W]   If not already in use, --useCudaGraph (utilize CUDA graphs where possible) may increase the throughput.
[07/12/2022-12:48:20] [I] Explanations of the performance metrics are printed in the verbose logs.
[07/12/2022-12:48:20] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v8201] # /usr/src/tensorrt/bin/trtexec --onnx=updated_resize_Avg_pool.onnx --useDLACore=0

Thanks.

1 Like

Please check the comment above.
Thanks.

Okay, thanks for the solution, will try that

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.