Xavier NX does not support adaptative average pooling on DLA?

AnaisG · August 29, 2023, 7:08am

Hi,

I’m trying to compile ResNet50 ONXX to TRT on the DLA in a Xavier NX. But the Adaptative Average Pooling falls back to GPU. I tried:

to change the graph of my ONNX model to set count_include_pad=1 for inclusive pooling
to change the jetpack version (4.5.1 to 4.6.4)

Can you tell me if the Adaptative Average Polling is supported on the DLA ? If so, how should I proceed?

Thanks

spolisetty · August 29, 2023, 7:34am

Hi,

We hope the following document may help you.

If you need further assistance, we are moving this post to the Jetson Xavier NX forum to get better help.

Thank you.

AakankshaS · August 29, 2023, 7:37am

Hi,
Please check the below links, as they might answer your concerns.

Thanks!

AastaLLL · August 30, 2023, 6:42am

Hi,

Based on the document, here is the constraint of the DLA pooling layer:

Pooling layer

Only two spatial dimension operations are supported.
Both FP16 and INT8 are supported.
Operations supported: kMAX, kAVERAGE.
Dimensions of the window must be in the range [1, 8].
Dimensions of padding must be in the range [0, 7].
Dimensions of stride must be in the range [1, 16].
With INT8 mode, input and output tensor scales must be the same.

Thanks.

AnaisG · September 6, 2023, 8:39am

Sorry for the delay and thanks all for your answers.
I changed my Adaptive Pooling by an Average Pooling. Now, the pooling runs on the DLA.

However, my ResNet50 still doesn’t run fully on DLA. I can’t use GPU. Since I changed the pooling layer, an Identity and a Shuffle layer has been added during the trt convertion but I can’t see these layers in my ONNX graph. In addition, I read on the documentation:

"For both the ElementWise equal layer and the subsequent IIdentityLayer mentioned above, explicitly set your device types to DLA and their precisions to INT8. Otherwise, these layers will run on the GPU. "

So I tried to convert my model using the following command:

/usr/src/tensorrt/bin/trtexec --onnx=resnet50_new_pool.onnx --useDLACore=0 --best --allowGPUFallback

to allow int8, fp16 and fp32 precisions. But I still have GPU fallbacks :

[09/06/2023-10:28:31] [I] [TRT] ---------- Layers Running on DLA ----------
[09/06/2023-10:28:31] [I] [TRT] [DlaLayer] {ForeignNode[/conv1/Conv.../layer4/layer4.2/relu_2/Relu]}
[09/06/2023-10:28:31] [I] [TRT] [DlaLayer] {ForeignNode[/avgpool/AveragePool.../fc/Gemm]}
[09/06/2023-10:28:31] [I] [TRT] ---------- Layers Running on GPU ----------
[09/06/2023-10:28:31] [I] [TRT] [GpuLayer] (Unnamed Layer* 119) [Identity]
[09/06/2023-10:28:31] [I] [TRT] [GpuLayer] (Unnamed Layer* 124) [Shuffle]

I also tried with ResNet34 and EfficientNet B0 but I still have the problem.

Do you have an idea to help me ?

Best regards.

AastaLLL · September 13, 2023, 8:19am

Hi,

The layer is added automatically to convert the data to be DLA-compatible.
You can do this by feeding the required format directly.

For example:

/usr/src/tensorrt/bin/trtexec --inputIOFormats=fp16:hwc8 --outputIOFormats=fp16:hwc8 ...

Thanks.

AnaisG · September 14, 2023, 6:12am

Hello,

Thank you very much for your answer. However, it does not work on my DLA. Indeed, I tried several config but I always have a Segmentation Fault.

For example, I tried :

/usr/src/tensorrt/bin/trtexec --inputIOFormats=fp16:hwc8 --outputIOFormats=fp16:hwc8 --onnx=resnet50_new_pool.onnx --useDLACore=0 --allowGPUFallback

/usr/src/tensorrt/bin/trtexec --inputIOFormats=fp16:chw16 --outputIOFormats=fp16:chw16 --onnx=resnet50_new_pool.onnx --useDLACore=0 --allowGPUFallback

/usr/src/tensorrt/bin/trtexec --inputIOFormats=fp32:chw32 --outputIOFormats=fp32:chw32 --onnx=resnet50_new_pool.onnx --useDLACore=0 --allowGPUFallback

And I always get :

&&&& RUNNING TensorRT.trtexec [TensorRT v8001] # /usr/src/tensorrt/bin/trtexec --inputIOFormats=fp32:chw32 --outputIOFormats=fp32:chw32 --onnx=resnet50_new_pool.onnx --useDLACore=0 --allowGPUFallback
[09/14/2023-07:42:06] [I] === Model Options ===
[09/14/2023-07:42:06] [I] Format: ONNX
[09/14/2023-07:42:06] [I] Model: resnet50_new_pool.onnx
[09/14/2023-07:42:06] [I] Output:
[09/14/2023-07:42:06] [I] === Build Options ===
[09/14/2023-07:42:06] [I] Max batch: explicit
[09/14/2023-07:42:06] [I] Workspace: 16 MiB
[09/14/2023-07:42:06] [I] minTiming: 1
[09/14/2023-07:42:06] [I] avgTiming: 8
[09/14/2023-07:42:06] [I] Precision: FP32
[09/14/2023-07:42:06] [I] Calibration: 
[09/14/2023-07:42:06] [I] Refit: Disabled
[09/14/2023-07:42:06] [I] Sparsity: Disabled
[09/14/2023-07:42:06] [I] Safe mode: Disabled
[09/14/2023-07:42:06] [I] Restricted mode: Disabled
[09/14/2023-07:42:06] [I] Save engine: 
[09/14/2023-07:42:06] [I] Load engine: 
[09/14/2023-07:42:06] [I] NVTX verbosity: 0
[09/14/2023-07:42:06] [I] Tactic sources: Using default tactic sources
[09/14/2023-07:42:06] [I] timingCacheMode: local
[09/14/2023-07:42:06] [I] timingCacheFile: 
[09/14/2023-07:42:06] [I] Input(s): fp32:+chw32
[09/14/2023-07:42:06] [I] Output(s): fp32:+chw32
[09/14/2023-07:42:06] [I] Input build shapes: model
[09/14/2023-07:42:06] [I] Input calibration shapes: model
[09/14/2023-07:42:06] [I] === System Options ===
[09/14/2023-07:42:06] [I] Device: 0
[09/14/2023-07:42:06] [I] DLACore: 0(With GPU fallback)
[09/14/2023-07:42:06] [I] Plugins:
[09/14/2023-07:42:06] [I] === Inference Options ===
[09/14/2023-07:42:06] [I] Batch: Explicit
[09/14/2023-07:42:06] [I] Input inference shapes: model
[09/14/2023-07:42:06] [I] Iterations: 10
[09/14/2023-07:42:06] [I] Duration: 3s (+ 200ms warm up)
[09/14/2023-07:42:06] [I] Sleep time: 0ms
[09/14/2023-07:42:06] [I] Streams: 1
[09/14/2023-07:42:06] [I] ExposeDMA: Disabled
[09/14/2023-07:42:06] [I] Data transfers: Enabled
[09/14/2023-07:42:06] [I] Spin-wait: Disabled
[09/14/2023-07:42:06] [I] Multithreading: Disabled
[09/14/2023-07:42:06] [I] CUDA Graph: Disabled
[09/14/2023-07:42:06] [I] Separate profiling: Disabled
[09/14/2023-07:42:06] [I] Time Deserialize: Disabled
[09/14/2023-07:42:06] [I] Time Refit: Disabled
[09/14/2023-07:42:06] [I] Skip inference: Disabled
[09/14/2023-07:42:06] [I] Inputs:
[09/14/2023-07:42:06] [I] === Reporting Options ===
[09/14/2023-07:42:06] [I] Verbose: Disabled
[09/14/2023-07:42:06] [I] Averages: 10 inferences
[09/14/2023-07:42:06] [I] Percentile: 99
[09/14/2023-07:42:06] [I] Dump refittable layers:Disabled
[09/14/2023-07:42:06] [I] Dump output: Disabled
[09/14/2023-07:42:06] [I] Profile: Disabled
[09/14/2023-07:42:06] [I] Export timing to JSON file: 
[09/14/2023-07:42:06] [I] Export output to JSON file: 
[09/14/2023-07:42:06] [I] Export profile to JSON file: 
[09/14/2023-07:42:06] [I] 
[09/14/2023-07:42:06] [I] === Device Information ===
[09/14/2023-07:42:06] [I] Selected Device: Xavier
[09/14/2023-07:42:06] [I] Compute Capability: 7.2
[09/14/2023-07:42:06] [I] SMs: 6
[09/14/2023-07:42:06] [I] Compute Clock Rate: 1.109 GHz
[09/14/2023-07:42:06] [I] Device Global Memory: 7765 MiB
[09/14/2023-07:42:06] [I] Shared Memory per SM: 96 KiB
[09/14/2023-07:42:06] [I] Memory Bus Width: 256 bits (ECC disabled)
[09/14/2023-07:42:06] [I] Memory Clock Rate: 1.109 GHz
[09/14/2023-07:42:06] [I] 
[09/14/2023-07:42:06] [I] TensorRT version: 8001
[09/14/2023-07:42:08] [I] [TRT] [MemUsageChange] Init CUDA: CPU +354, GPU +0, now: CPU 372, GPU 5556 (MiB)
[09/14/2023-07:42:08] [I] Start parsing network model
[09/14/2023-07:42:08] [I] [TRT] ----------------------------------------------------------------
[09/14/2023-07:42:08] [I] [TRT] Input filename:   resnet50_new_pool.onnx
[09/14/2023-07:42:08] [I] [TRT] ONNX IR version:  0.0.7
[09/14/2023-07:42:08] [I] [TRT] Opset version:    14
[09/14/2023-07:42:08] [I] [TRT] Producer name:    pytorch
[09/14/2023-07:42:08] [I] [TRT] Producer version: 2.0.0
[09/14/2023-07:42:08] [I] [TRT] Domain:           
[09/14/2023-07:42:08] [I] [TRT] Model version:    0
[09/14/2023-07:42:08] [I] [TRT] Doc string:       
[09/14/2023-07:42:08] [I] [TRT] ----------------------------------------------------------------
[09/14/2023-07:42:08] [W] [TRT] onnx2trt_utils.cpp:364: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[09/14/2023-07:42:08] [I] Finish parsing network model
[09/14/2023-07:42:08] [I] [TRT] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 470, GPU 5753 (MiB)
[09/14/2023-07:42:08] [W] [TRT] Default DLA is enabled but layer (Unnamed Layer* 119) [Identity] is not supported on DLA, falling back to GPU.
[09/14/2023-07:42:08] [W] [TRT] Default DLA is enabled but layer /Flatten is not supported on DLA, falling back to GPU.
[09/14/2023-07:42:08] [W] [TRT] Default DLA is enabled but layer (Unnamed Layer* 122) [Shuffle] is not supported on DLA, falling back to GPU.
[09/14/2023-07:42:08] [W] [TRT] Default DLA is enabled but layer (Unnamed Layer* 124) [Shuffle] is not supported on DLA, falling back to GPU.
[09/14/2023-07:42:08] [I] [TRT] [MemUsageSnapshot] Builder begin: CPU 470 MiB, GPU 5753 MiB
[09/14/2023-07:42:08] [W] [TRT] output: formats with vectorized dimension require at least 3 dimensions, but dimensions are [1,1000]. Ignoring format CHW32 for type Float.
[09/14/2023-07:42:08] [E] Error[4]: [graphNodes.cpp::checkUserIOFormatsViableHelper::697] Error Code 4: Internal Error (output: no formats available.)
[09/14/2023-07:42:08] [E] Error[2]: [builder.cpp::buildSerializedNetwork::417] Error Code 2: Internal Error (Assertion enginePtr != nullptr failed.)
Segmentation fault (core dumped)

Maybe I didn’t choose the right input or output formats ? Do you have an idea ?

I’m using JetPack4.6.4 and tensorrt 8.0.1.6-1+cuda10.2 with cuda 10.2.460-1

Thanks,

AastaLLL · September 18, 2023, 7:54am

Hi,

You can find the supported DLA input format below:

Could you share the output when running with inputIOFormats=fp16:hwc8 --outputIOFormats=fp16:hwc8 --fp16 with us?
Thanks.

AnaisG · September 18, 2023, 12:04pm

Hello,

Thank you for your reply.

I have the same error :

/usr/src/tensorrt/bin/trtexec --inputIOFormats=fp16:hwc8 --outputIOFormats=fp16:hwc8 --onnx=resnet50_new_pool.onnx --useDLACore=0 --allowGPUFallback --fp16
&&&& RUNNING TensorRT.trtexec [TensorRT v8001] # /usr/src/tensorrt/bin/trtexec --inputIOFormats=fp16:hwc8 --outputIOFormats=fp16:hwc8 --onnx=resnet50_new_pool.onnx --useDLACore=0 --allowGPUFallback --fp16
[09/18/2023-13:50:11] [I] === Model Options ===
[09/18/2023-13:50:11] [I] Format: ONNX
[09/18/2023-13:50:11] [I] Model: resnet50_new_pool.onnx
[09/18/2023-13:50:11] [I] Output:
[09/18/2023-13:50:11] [I] === Build Options ===
[09/18/2023-13:50:11] [I] Max batch: explicit
[09/18/2023-13:50:11] [I] Workspace: 16 MiB
[09/18/2023-13:50:11] [I] minTiming: 1
[09/18/2023-13:50:11] [I] avgTiming: 8
[09/18/2023-13:50:11] [I] Precision: FP32+FP16
[09/18/2023-13:50:11] [I] Calibration: 
[09/18/2023-13:50:11] [I] Refit: Disabled
[09/18/2023-13:50:11] [I] Sparsity: Disabled
[09/18/2023-13:50:11] [I] Safe mode: Disabled
[09/18/2023-13:50:11] [I] Restricted mode: Disabled
[09/18/2023-13:50:11] [I] Save engine: 
[09/18/2023-13:50:11] [I] Load engine: 
[09/18/2023-13:50:11] [I] NVTX verbosity: 0
[09/18/2023-13:50:11] [I] Tactic sources: Using default tactic sources
[09/18/2023-13:50:11] [I] timingCacheMode: local
[09/18/2023-13:50:11] [I] timingCacheFile: 
[09/18/2023-13:50:11] [I] Input(s): fp16:+hwc8
[09/18/2023-13:50:11] [I] Output(s): fp16:+hwc8
[09/18/2023-13:50:11] [I] Input build shapes: model
[09/18/2023-13:50:11] [I] Input calibration shapes: model
[09/18/2023-13:50:11] [I] === System Options ===
[09/18/2023-13:50:11] [I] Device: 0
[09/18/2023-13:50:11] [I] DLACore: 0(With GPU fallback)
[09/18/2023-13:50:11] [I] Plugins:
[09/18/2023-13:50:11] [I] === Inference Options ===
[09/18/2023-13:50:11] [I] Batch: Explicit
[09/18/2023-13:50:11] [I] Input inference shapes: model
[09/18/2023-13:50:11] [I] Iterations: 10
[09/18/2023-13:50:11] [I] Duration: 3s (+ 200ms warm up)
[09/18/2023-13:50:11] [I] Sleep time: 0ms
[09/18/2023-13:50:11] [I] Streams: 1
[09/18/2023-13:50:11] [I] ExposeDMA: Disabled
[09/18/2023-13:50:11] [I] Data transfers: Enabled
[09/18/2023-13:50:11] [I] Spin-wait: Disabled
[09/18/2023-13:50:11] [I] Multithreading: Disabled
[09/18/2023-13:50:11] [I] CUDA Graph: Disabled
[09/18/2023-13:50:11] [I] Separate profiling: Disabled
[09/18/2023-13:50:11] [I] Time Deserialize: Disabled
[09/18/2023-13:50:11] [I] Time Refit: Disabled
[09/18/2023-13:50:11] [I] Skip inference: Disabled
[09/18/2023-13:50:11] [I] Inputs:
[09/18/2023-13:50:11] [I] === Reporting Options ===
[09/18/2023-13:50:11] [I] Verbose: Disabled
[09/18/2023-13:50:11] [I] Averages: 10 inferences
[09/18/2023-13:50:11] [I] Percentile: 99
[09/18/2023-13:50:11] [I] Dump refittable layers:Disabled
[09/18/2023-13:50:11] [I] Dump output: Disabled
[09/18/2023-13:50:11] [I] Profile: Disabled
[09/18/2023-13:50:11] [I] Export timing to JSON file: 
[09/18/2023-13:50:11] [I] Export output to JSON file: 
[09/18/2023-13:50:11] [I] Export profile to JSON file: 
[09/18/2023-13:50:11] [I] 
[09/18/2023-13:50:11] [I] === Device Information ===
[09/18/2023-13:50:11] [I] Selected Device: Xavier
[09/18/2023-13:50:11] [I] Compute Capability: 7.2
[09/18/2023-13:50:11] [I] SMs: 6
[09/18/2023-13:50:11] [I] Compute Clock Rate: 1.109 GHz
[09/18/2023-13:50:11] [I] Device Global Memory: 7765 MiB
[09/18/2023-13:50:11] [I] Shared Memory per SM: 96 KiB
[09/18/2023-13:50:11] [I] Memory Bus Width: 256 bits (ECC disabled)
[09/18/2023-13:50:11] [I] Memory Clock Rate: 1.109 GHz
[09/18/2023-13:50:11] [I] 
[09/18/2023-13:50:11] [I] TensorRT version: 8001
[09/18/2023-13:50:14] [I] [TRT] [MemUsageChange] Init CUDA: CPU +353, GPU +0, now: CPU 371, GPU 3303 (MiB)
[09/18/2023-13:50:14] [I] Start parsing network model
[09/18/2023-13:50:15] [I] [TRT] ----------------------------------------------------------------
[09/18/2023-13:50:15] [I] [TRT] Input filename:   resnet50_new_pool.onnx
[09/18/2023-13:50:15] [I] [TRT] ONNX IR version:  0.0.7
[09/18/2023-13:50:15] [I] [TRT] Opset version:    14
[09/18/2023-13:50:15] [I] [TRT] Producer name:    pytorch
[09/18/2023-13:50:15] [I] [TRT] Producer version: 2.0.0
[09/18/2023-13:50:15] [I] [TRT] Domain:           
[09/18/2023-13:50:15] [I] [TRT] Model version:    0
[09/18/2023-13:50:15] [I] [TRT] Doc string:       
[09/18/2023-13:50:15] [I] [TRT] ----------------------------------------------------------------
[09/18/2023-13:50:15] [W] [TRT] onnx2trt_utils.cpp:364: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[09/18/2023-13:50:15] [I] Finish parsing network model
[09/18/2023-13:50:15] [I] [TRT] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 470, GPU 3621 (MiB)
[09/18/2023-13:50:15] [W] [TRT] Default DLA is enabled but layer (Unnamed Layer* 119) [Identity] is not supported on DLA, falling back to GPU.
[09/18/2023-13:50:15] [W] [TRT] Default DLA is enabled but layer /Flatten is not supported on DLA, falling back to GPU.
[09/18/2023-13:50:15] [W] [TRT] Default DLA is enabled but layer (Unnamed Layer* 122) [Shuffle] is not supported on DLA, falling back to GPU.
[09/18/2023-13:50:15] [W] [TRT] Default DLA is enabled but layer (Unnamed Layer* 124) [Shuffle] is not supported on DLA, falling back to GPU.
[09/18/2023-13:50:15] [I] [TRT] [MemUsageSnapshot] Builder begin: CPU 470 MiB, GPU 3622 MiB
[09/18/2023-13:50:15] [W] [TRT] output: formats with vectorized dimension require at least 3 dimensions, but dimensions are [1,1000]. Ignoring format HWC8 for type Half.
[09/18/2023-13:50:15] [E] Error[4]: [graphNodes.cpp::checkUserIOFormatsViableHelper::697] Error Code 4: Internal Error (output: no formats available.)
[09/18/2023-13:50:15] [E] Error[2]: [builder.cpp::buildSerializedNetwork::417] Error Code 2: Internal Error (Assertion enginePtr != nullptr failed.)
Segmentation fault (core dumped)

If I understood well, there is a problem with the output but I didn’t change anything from the original ResNet50.

Thanks.

AastaLLL · September 20, 2023, 2:28am

Hi,

We try to reproduce this issue with TensorRT’s model (/usr/src/tensorrt/data/resnet50/ResNet50.onnx)
But it gets stuck at a non-supported layer which seems not aligned to your observation.

...
[09/20/2023-02:15:45] [I] Finish parsing network model
[09/20/2023-02:15:45] [E] Error[2]: [network.cpp::operator()::2682] Error Code 2: Internal Error (Assertion allowGPUFallback failed. Layer 'node_of_OC2_DUMMY_0' is not supported on DLA but GPU fallback is not enabled.)
[09/20/2023-02:15:45] [E] Error[4]: [network.cpp::validate::2789] Error Code 4: Internal Error (DLA validation failed)
[09/20/2023-02:15:45] [E] Error[2]: [builder.cpp::buildSerializedNetwork::751] Error Code 2: Internal Error (Assertion engine != nullptr failed. )
[09/20/2023-02:15:45] [E] Engine could not be created from network
[09/20/2023-02:15:45] [E] Building engine failed
[09/20/2023-02:15:45] [E] Failed to create engine from model or file.
[09/20/2023-02:15:45] [E] Engine set up failed
&&&& FAILED TensorRT.trtexec [TensorRT v8502] # /usr/src/tensorrt/bin/trtexec --onnx=/usr/src/tensorrt/data/resnet50/ResNet50.onnx --useDLACore=0

Do you use a custom model? If yes, could you share the model with us?
Thanks.

AnaisG · September 20, 2023, 5:40am

Hello,

Thanks for your answer.
I don’t use custom layer, I used the ResNet50 model from torchvision. Then, I only changed the AdaptiveAveragePooling() by AvgPool2d(). Finally, I convert my model in an ONNX model. However, as you can see in my message of the 6th of September, the average pooling works well on DLA. I only have problem with the Identity and the Shuffle Layers added during the TensorRT conversion.

import torch
import torch.nn as nn

from torchvision.models import resnet50, ResNet50_Weights


model = resnet50(weights = ResNet50_Weights.IMAGENET1K_V2)

input_size = 7
output_size = 1
stride = (input_size//output_size)
kernel_size = input_size-(output_size-1)*stride
padding = 0

model_new = resnet50(weights=ResNet50_Weights.IMAGENET1K_V2)
model_new.avgpool = nn.AvgPool2d(kernel_size, 
                                 stride=stride, 
                                 padding=padding, 
                                 count_include_pad=True)

Thanks

AastaLLL · September 20, 2023, 8:41am

Hi,

Would you mind also sharing the ONNX model with us?
Thanks.

AnaisG · September 20, 2023, 9:13am

Hello,

you will find enclosed the ONNX model.
Thanks,

resnet50_newpool.onnx (97.4 MB)

AastaLLL · September 21, 2023, 7:36am

Hi,

We test your model and output is different compared to your log.
In our experiment, the DLA engine fails to generate due to a non-supported layer (Identity):

[09/21/2023-15:21:32] [W] [TRT] onnx2trt_utils.cpp:375: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[09/21/2023-15:21:32] [I] Finish parsing network model
[09/21/2023-15:21:32] [I] FP32 and INT8 precisions have been specified - more performance might be enabled by additionally specifying --fp16 or --best
[09/21/2023-15:21:32] [E] Error[2]: [network.cpp::operator()::2682] Error Code 2: Internal Error (Assertion allowGPUFallback failed. Layer '(Unnamed Layer* 119) [Identity]' is not supported on DLA but GPU fallback is not enabled.)
[09/21/2023-15:21:32] [E] Error[4]: [network.cpp::validate::2789] Error Code 4: Internal Error (DLA validation failed)
[09/21/2023-15:21:32] [E] Error[2]: [builder.cpp::buildSerializedNetwork::751] Error Code 2: Internal Error (Assertion engine != nullptr failed. )
[09/21/2023-15:21:32] [E] Engine could not be created from network
[09/21/2023-15:21:32] [E] Building engine failed
[09/21/2023-15:21:32] [E] Failed to create engine from model or file.
[09/21/2023-15:21:32] [E] Engine set up failed
&&&& FAILED TensorRT.trtexec [TensorRT v8502] # /usr/src/tensorrt/bin/trtexec --onnx=resnet50_newpool.onnx --int8 --useDLACore=0

Checking with polygraphy tool, the layer is added between activation and the average pooling layer.

$ git clone -b release/8.5 https://github.com/NVIDIA/TensorRT.git
$ cd TensorRT/tools/Polygraphy/
$ sudo make install
$ polygraphy convert resnet50_newpool.onnx --convert-to=onnx-like-trt-network --fp16 --tensor-formats=input.1:[hwc8] --tensor-formats=output:[hwc8] -o resnet50_newpool.pb

Is it possible to remove it? It looks like related to the padding.

Thanks.

AnaisG · September 21, 2023, 9:17am

Hello,

Thanks for your reply. I managed to remove the identity layer. The Segmentation Fault not appears anymore now. But I still have the problem of the shuffle layer as indicated in my previous messages.
I got this issue:

[09/21/2023-11:07:16] [I] [TRT] ---------- Layers Running on DLA ----------
[09/21/2023-11:07:16] [I] [TRT] [DlaLayer] {ForeignNode[/conv1/Conv.../fc/Gemm]}
[09/21/2023-11:07:16] [I] [TRT] ---------- Layers Running on GPU ----------
[09/21/2023-11:07:16] [I] [TRT] [GpuLayer] (Unnamed Layer* 123) [Shuffle]

Do you have this problem ? And do you know how to solve it ?

Thanks.

AastaLLL · September 22, 2023, 3:34am

Hi,

Based on the TensorRT log, the Shuffle layer is added by the usage of the Flatten layer.

...
[09/22/2023-11:08:14] [I] Finish parsing network model
[09/22/2023-11:08:14] [W] [TRT] Layer '(Unnamed Layer* 119) [Identity]' (CAST): Unsupported on DLA. Switching this layer's device type to GPU.
[09/22/2023-11:08:14] [W] [TRT] Layer '/Flatten' (SHUFFLE): Unsupported on DLA. Switching this layer's device type to GPU.
[09/22/2023-11:08:14] [W] [TRT] Layer 'fc.weight' (CONSTANT): Unsupported on DLA. Switching this layer's device type to GPU.
[09/22/2023-11:08:14] [W] [TRT] Layer 'fc.bias' (CONSTANT): Unsupported on DLA. Switching this layer's device type to GPU.
[09/22/2023-11:08:14] [W] [TRT] Layer '(Unnamed Layer* 125) [Shuffle]' (SHUFFLE): Unsupported on DLA. Switching this layer's device type to GPU.
...

Thanks

AnaisG · September 22, 2023, 5:28am

Hi,

So, you confirm that it’s not possible to run an entire ResNet50 only on DLA?

AastaLLL · September 22, 2023, 6:53am

Hi,

You can try to modify the model so it won’t need a Shuffle layer.

We have a script that can modify the model to be DLA-compatible.
However, it will need some modification for TorchVision’s ResNet50.

Could you give it a try?

Install our ONNX graphsurgeon first. steps.
Modify the model with the below script:

github.com

NVIDIA/Deep-Learning-Accelerator-SW/blob/main/scripts/prepare_models/resnet50.py

#
# SPDX-FileCopyrightText: Copyright (c) 2022-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: MIT
#
# NVIDIA CORPORATION, its affiliates and licensors retain all intellectual
# property and proprietary rights in and to this material, related
# documentation and any modifications thereto. Any use, reproduction,
# disclosure or distribution of this material and related documentation
# without an express license agreement from NVIDIA CORPORATION or
# its affiliates is strictly prohibited.
#
"""ONNX preparation for ResNet-50."""
import os
import onnx
import numpy as np
import onnx_graphsurgeon as gs
from onnx import shape_inference
import common

This file has been truncated. show original

Thanks.

AnaisG · September 22, 2023, 7:11am

Thanks for your help. I will try on my ResNet50 !

I will get back to you as soon as possible.

AnaisG · September 27, 2023, 7:46am

Hello,

I tried with TensorFlow but I have the same error with shuffle layers…

/usr/src/tensorrt/bin/trtexec --onnx=resnet50_tf.onnx --best --useDLACore=0 --allowGPUFallback
&&&& RUNNING TensorRT.trtexec [TensorRT v8001] # /usr/src/tensorrt/bin/trtexec --onnx=resnet50_tf.onnx --best --useDLACore=0 --allowGPUFallback
[09/27/2023-09:18:07] [I] === Model Options ===
[09/27/2023-09:18:07] [I] Format: ONNX
[09/27/2023-09:18:07] [I] Model: resnet50_tf.onnx
[09/27/2023-09:18:07] [I] Output:
[09/27/2023-09:18:07] [I] === Build Options ===
[09/27/2023-09:18:07] [I] Max batch: explicit
[09/27/2023-09:18:07] [I] Workspace: 16 MiB
[09/27/2023-09:18:07] [I] minTiming: 1
[09/27/2023-09:18:07] [I] avgTiming: 8
[09/27/2023-09:18:07] [I] Precision: FP32+FP16+INT8
[09/27/2023-09:18:07] [I] Calibration: Dynamic
[09/27/2023-09:18:07] [I] Refit: Disabled
[09/27/2023-09:18:07] [I] Sparsity: Disabled
[09/27/2023-09:18:07] [I] Safe mode: Disabled
[09/27/2023-09:18:07] [I] Restricted mode: Disabled
[09/27/2023-09:18:07] [I] Save engine: 
[09/27/2023-09:18:07] [I] Load engine: 
[09/27/2023-09:18:07] [I] NVTX verbosity: 0
[09/27/2023-09:18:07] [I] Tactic sources: Using default tactic sources
[09/27/2023-09:18:07] [I] timingCacheMode: local
[09/27/2023-09:18:07] [I] timingCacheFile: 
[09/27/2023-09:18:07] [I] Input(s)s format: fp32:CHW
[09/27/2023-09:18:07] [I] Output(s)s format: fp32:CHW
[09/27/2023-09:18:07] [I] Input build shapes: model
[09/27/2023-09:18:07] [I] Input calibration shapes: model
[09/27/2023-09:18:07] [I] === System Options ===
[09/27/2023-09:18:07] [I] Device: 0
[09/27/2023-09:18:07] [I] DLACore: 0(With GPU fallback)
[09/27/2023-09:18:07] [I] Plugins:
[09/27/2023-09:18:07] [I] === Inference Options ===
[09/27/2023-09:18:07] [I] Batch: Explicit
[09/27/2023-09:18:07] [I] Input inference shapes: model
[09/27/2023-09:18:07] [I] Iterations: 10
[09/27/2023-09:18:07] [I] Duration: 3s (+ 200ms warm up)
[09/27/2023-09:18:07] [I] Sleep time: 0ms
[09/27/2023-09:18:07] [I] Streams: 1
[09/27/2023-09:18:07] [I] ExposeDMA: Disabled
[09/27/2023-09:18:07] [I] Data transfers: Enabled
[09/27/2023-09:18:07] [I] Spin-wait: Disabled
[09/27/2023-09:18:07] [I] Multithreading: Disabled
[09/27/2023-09:18:07] [I] CUDA Graph: Disabled
[09/27/2023-09:18:07] [I] Separate profiling: Disabled
[09/27/2023-09:18:07] [I] Time Deserialize: Disabled
[09/27/2023-09:18:07] [I] Time Refit: Disabled
[09/27/2023-09:18:07] [I] Skip inference: Disabled
[09/27/2023-09:18:07] [I] Inputs:
[09/27/2023-09:18:07] [I] === Reporting Options ===
[09/27/2023-09:18:07] [I] Verbose: Disabled
[09/27/2023-09:18:07] [I] Averages: 10 inferences
[09/27/2023-09:18:07] [I] Percentile: 99
[09/27/2023-09:18:07] [I] Dump refittable layers:Disabled
[09/27/2023-09:18:07] [I] Dump output: Disabled
[09/27/2023-09:18:07] [I] Profile: Disabled
[09/27/2023-09:18:07] [I] Export timing to JSON file: 
[09/27/2023-09:18:07] [I] Export output to JSON file: 
[09/27/2023-09:18:07] [I] Export profile to JSON file: 
[09/27/2023-09:18:07] [I] 
[09/27/2023-09:18:07] [I] === Device Information ===
[09/27/2023-09:18:07] [I] Selected Device: Xavier
[09/27/2023-09:18:07] [I] Compute Capability: 7.2
[09/27/2023-09:18:07] [I] SMs: 6
[09/27/2023-09:18:07] [I] Compute Clock Rate: 1.109 GHz
[09/27/2023-09:18:07] [I] Device Global Memory: 7773 MiB
[09/27/2023-09:18:07] [I] Shared Memory per SM: 96 KiB
[09/27/2023-09:18:07] [I] Memory Bus Width: 256 bits (ECC disabled)
[09/27/2023-09:18:07] [I] Memory Clock Rate: 1.109 GHz
[09/27/2023-09:18:07] [I] 
[09/27/2023-09:18:07] [I] TensorRT version: 8001
[09/27/2023-09:18:11] [I] [TRT] [MemUsageChange] Init CUDA: CPU +354, GPU +0, now: CPU 372, GPU 4477 (MiB)
[09/27/2023-09:18:11] [I] Start parsing network model
[09/27/2023-09:18:11] [I] [TRT] ----------------------------------------------------------------
[09/27/2023-09:18:11] [I] [TRT] Input filename:   resnet50_tf.onnx
[09/27/2023-09:18:11] [I] [TRT] ONNX IR version:  0.0.7
[09/27/2023-09:18:11] [I] [TRT] Opset version:    13
[09/27/2023-09:18:11] [I] [TRT] Producer name:    tf2onnx
[09/27/2023-09:18:11] [I] [TRT] Producer version: 1.15.1 37820d
[09/27/2023-09:18:11] [I] [TRT] Domain:           
[09/27/2023-09:18:11] [I] [TRT] Model version:    0
[09/27/2023-09:18:11] [I] [TRT] Doc string:       
[09/27/2023-09:18:11] [I] [TRT] ----------------------------------------------------------------
[09/27/2023-09:18:11] [W] [TRT] onnx2trt_utils.cpp:364: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[09/27/2023-09:18:11] [W] [TRT] ShapedWeights.cpp:173: Weights resnet50/predictions/MatMul/ReadVariableOp:0 has been transposed with permutation of (1, 0)! If you plan on overwriting the weights with the Refitter API, the new weights must be pre-transposed.
[09/27/2023-09:18:11] [I] Finish parsing network model
[09/27/2023-09:18:11] [I] [TRT] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 478, GPU 4681 (MiB)
[09/27/2023-09:18:11] [W] Dynamic dimensions required for input: input, but no shapes were provided. Automatically overriding shape to: 1x224x224x3
[09/27/2023-09:18:11] [W] [TRT] Default DLA is enabled but layer resnet50/conv1_conv/Conv2D__6 is not supported on DLA, falling back to GPU.
[09/27/2023-09:18:11] [W] [TRT] Default DLA is enabled but layer resnet50/pool1_pad/Pad is not supported on DLA, falling back to GPU.
[09/27/2023-09:18:11] [W] [TRT] Default DLA is enabled but layer resnet50/avg_pool/Mean is not supported on DLA, falling back to GPU.
[09/27/2023-09:18:11] [W] [TRT] DLA only supports FP16 and Int8 precision type. Switching (Unnamed Layer* 122) [Shape] device type to GPU.
[09/27/2023-09:18:11] [W] [TRT] Default DLA is enabled but layer (Unnamed Layer* 123) [Constant] is not supported on DLA, falling back to GPU.
[09/27/2023-09:18:11] [W] [TRT] Default DLA is enabled but layer (Unnamed Layer* 124) [Gather] is not supported on DLA, falling back to GPU.
[09/27/2023-09:18:11] [W] [TRT] Default DLA is enabled but layer resnet50/avg_pool/Mean_Squeeze__614 is not supported on DLA, falling back to GPU.
[09/27/2023-09:18:11] [W] [TRT] Default DLA is enabled but layer resnet50/predictions/MatMul/ReadVariableOp:0 is not supported on DLA, falling back to GPU.
[09/27/2023-09:18:11] [W] [TRT] DLA only supports FP16 and Int8 precision type. Switching (Unnamed Layer* 127) [Shape] device type to GPU.
[09/27/2023-09:18:11] [W] [TRT] Default DLA is enabled but layer (Unnamed Layer* 128) [Constant] is not supported on DLA, falling back to GPU.
[09/27/2023-09:18:11] [W] [TRT] (Unnamed Layer* 129) [Concatenation]: DLA only supports concatenation on the C dimension.
[09/27/2023-09:18:11] [W] [TRT] Default DLA is enabled but layer (Unnamed Layer* 129) [Concatenation] is not supported on DLA, falling back to GPU.
[09/27/2023-09:18:11] [W] [TRT] Default DLA is enabled but layer (Unnamed Layer* 130) [Constant] is not supported on DLA, falling back to GPU.
[09/27/2023-09:18:11] [W] [TRT] Default DLA is enabled but layer (Unnamed Layer* 131) [Gather] is not supported on DLA, falling back to GPU.
[09/27/2023-09:18:11] [W] [TRT] Default DLA is enabled but layer (Unnamed Layer* 132) [Shuffle] is not supported on DLA, falling back to GPU.
[09/27/2023-09:18:11] [W] [TRT] DLA only supports FP16 and Int8 precision type. Switching (Unnamed Layer* 134) [Shape] device type to GPU.
[09/27/2023-09:18:11] [W] [TRT] Default DLA is enabled but layer (Unnamed Layer* 135) [Constant] is not supported on DLA, falling back to GPU.
[09/27/2023-09:18:11] [W] [TRT] Default DLA is enabled but layer (Unnamed Layer* 136) [Gather] is not supported on DLA, falling back to GPU.
[09/27/2023-09:18:11] [W] [TRT] Default DLA is enabled but layer (Unnamed Layer* 137) [Shuffle] is not supported on DLA, falling back to GPU.
[09/27/2023-09:18:11] [W] [TRT] Default DLA is enabled but layer resnet50/predictions/BiasAdd/ReadVariableOp:0 is not supported on DLA, falling back to GPU.
[09/27/2023-09:18:11] [W] [TRT] Default DLA is enabled but layer (Unnamed Layer* 139) [Shuffle] is not supported on DLA, falling back to GPU.
[09/27/2023-09:18:11] [W] [TRT] Default DLA is enabled but layer resnet50/predictions/Softmax is not supported on DLA, falling back to GPU.
[09/27/2023-09:18:11] [W] [TRT] Default DLA is enabled but layer (Unnamed Layer* 142) [Shuffle] is not supported on DLA, falling back to GPU.
[09/27/2023-09:18:11] [W] [TRT] DLA only supports FP16 and Int8 precision type. Switching (Unnamed Layer* 143) [Shape] device type to GPU.
[09/27/2023-09:18:11] [I] [TRT] [MemUsageSnapshot] Builder begin: CPU 478 MiB, GPU 4681 MiB
[09/27/2023-09:18:11] [W] [TRT] Calibrator is not being used. Users must provide dynamic range for all tensors that are not Int32.
[09/27/2023-09:18:12] [W] [TRT] Input tensor has less than 4 dimensions for resnet50/predictions/BiasAdd. At least one shuffle layer will be inserted which cannot run on DLA.
[09/27/2023-09:18:13] [I] [TRT] ---------- Layers Running on DLA ----------
[09/27/2023-09:18:13] [I] [TRT] [DlaLayer] {ForeignNode[Conv__435...resnet50/conv1_relu/Relu]}
[09/27/2023-09:18:13] [I] [TRT] [DlaLayer] {ForeignNode[resnet50/pool1_pool/MaxPool...resnet50/conv5_block3_out/Relu]}
[09/27/2023-09:18:13] [I] [TRT] [DlaLayer] {ForeignNode[resnet50/predictions/MatMul]}
[09/27/2023-09:18:13] [I] [TRT] [DlaLayer] {ForeignNode[resnet50/predictions/BiasAdd]}
[09/27/2023-09:18:13] [I] [TRT] ---------- Layers Running on GPU ----------
[09/27/2023-09:18:13] [I] [TRT] [GpuLayer] resnet50/predictions/BiasAdd/ReadVariableOp:0
[09/27/2023-09:18:13] [I] [TRT] [GpuLayer] resnet50/conv1_conv/Conv2D__6
[09/27/2023-09:18:13] [I] [TRT] [GpuLayer] (Unnamed Layer* 139) [Shuffle]
[09/27/2023-09:18:13] [I] [TRT] [GpuLayer] resnet50/pool1_pad/Pad
[09/27/2023-09:18:13] [I] [TRT] [GpuLayer] resnet50/avg_pool/Mean
[09/27/2023-09:18:13] [I] [TRT] [GpuLayer] (Unnamed Layer* 137) [Shuffle]
[09/27/2023-09:18:13] [I] [TRT] [GpuLayer] shuffle_resnet50/predictions/MatMul:0
[09/27/2023-09:18:13] [I] [TRT] [GpuLayer] shuffle_(Unnamed Layer* 139) [Shuffle]_output
[09/27/2023-09:18:13] [I] [TRT] [GpuLayer] shuffle_resnet50/predictions/BiasAdd:0
[09/27/2023-09:18:13] [I] [TRT] [GpuLayer] resnet50/predictions/Softmax
[09/27/2023-09:18:15] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +192, GPU +250, now: CPU 706, GPU 4971 (MiB)
[09/27/2023-09:18:18] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +307, GPU +513, now: CPU 1013, GPU 5484 (MiB)
[09/27/2023-09:18:18] [W] [TRT] Detected invalid timing cache, setup a local cache instead
[09/27/2023-09:18:37] [W] [TRT] No implementation obeys reformatting-free rules, at least 2 reformatting nodes are needed, now picking the fastest path instead.
[09/27/2023-09:18:37] [I] [TRT] Detected 1 inputs and 1 output network tensors.
[09/27/2023-09:18:37] [I] [TRT] Total Host Persistent Memory: 3408
[09/27/2023-09:18:37] [I] [TRT] Total Device Persistent Memory: 0
[09/27/2023-09:18:37] [I] [TRT] Total Scratch Memory: 0
[09/27/2023-09:18:37] [I] [TRT] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 65 MiB, GPU 13 MiB
[09/27/2023-09:18:37] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +1, GPU +0, now: CPU 1102, GPU 5789 (MiB)
[09/27/2023-09:18:37] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +0, now: CPU 1102, GPU 5789 (MiB)
[09/27/2023-09:18:37] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 1101, GPU 5789 (MiB)
[09/27/2023-09:18:37] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 1101, GPU 5789 (MiB)
[09/27/2023-09:18:37] [I] [TRT] [MemUsageSnapshot] Builder end: CPU 1101 MiB, GPU 5789 MiB
[09/27/2023-09:18:38] [I] [TRT] Loaded engine size: 65 MB
[09/27/2023-09:18:38] [I] [TRT] [MemUsageSnapshot] deserializeCudaEngine begin: CPU 1101 MiB, GPU 5792 MiB
[09/27/2023-09:18:38] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 1167, GPU 5854 (MiB)
[09/27/2023-09:18:38] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +0, now: CPU 1167, GPU 5854 (MiB)
[09/27/2023-09:18:38] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 1167, GPU 5854 (MiB)
[09/27/2023-09:18:38] [I] [TRT] [MemUsageSnapshot] deserializeCudaEngine end: CPU 1167 MiB, GPU 5854 MiB
[09/27/2023-09:18:38] [I] Engine built in 30.4537 sec.
[09/27/2023-09:18:38] [I] [TRT] [MemUsageSnapshot] ExecutionContext creation begin: CPU 995 MiB, GPU 5788 MiB
[09/27/2023-09:18:38] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 995, GPU 5788 (MiB)
[09/27/2023-09:18:38] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +0, now: CPU 995, GPU 5788 (MiB)
[09/27/2023-09:18:38] [I] [TRT] [MemUsageSnapshot] ExecutionContext creation end: CPU 1061 MiB, GPU 5809 MiB
[09/27/2023-09:18:38] [I] Created input binding for input with dimensions 1x224x224x3
[09/27/2023-09:18:38] [I] Created output binding for predictions with dimensions 1x1000
[09/27/2023-09:18:38] [I] Starting inference
[09/27/2023-09:18:41] [I] Warmup completed 18 queries over 200 ms
[09/27/2023-09:18:41] [I] Timing trace has 298 queries over 3.0293 s
[09/27/2023-09:18:41] [I] 
[09/27/2023-09:18:41] [I] === Trace details ===
[09/27/2023-09:18:41] [I] Trace averages of 10 runs:
[09/27/2023-09:18:41] [I] Average on 10 runs - GPU latency: 9.94338 ms - Host latency: 9.99422 ms (end to end 10.003 ms, enqueue 9.78402 ms)
[09/27/2023-09:18:41] [I] Average on 10 runs - GPU latency: 9.93923 ms - Host latency: 9.99004 ms (end to end 10.0002 ms, enqueue 9.80433 ms)
[09/27/2023-09:18:41] [I] Average on 10 runs - GPU latency: 9.97124 ms - Host latency: 10.0221 ms (end to end 10.0319 ms, enqueue 9.8399 ms)
[09/27/2023-09:18:41] [I] Average on 10 runs - GPU latency: 9.94576 ms - Host latency: 9.99662 ms (end to end 10.0036 ms, enqueue 9.814 ms)
[09/27/2023-09:18:41] [I] Average on 10 runs - GPU latency: 9.95644 ms - Host latency: 10.0073 ms (end to end 10.017 ms, enqueue 9.78196 ms)
[09/27/2023-09:18:41] [I] Average on 10 runs - GPU latency: 9.94634 ms - Host latency: 9.99722 ms (end to end 10.0055 ms, enqueue 9.82343 ms)
[09/27/2023-09:18:41] [I] Average on 10 runs - GPU latency: 9.95096 ms - Host latency: 10.0018 ms (end to end 10.0099 ms, enqueue 9.84394 ms)
[09/27/2023-09:18:41] [I] Average on 10 runs - GPU latency: 10.0939 ms - Host latency: 10.145 ms (end to end 10.1978 ms, enqueue 10.0139 ms)
[09/27/2023-09:18:41] [I] Average on 10 runs - GPU latency: 10.2422 ms - Host latency: 10.2931 ms (end to end 10.3034 ms, enqueue 10.0641 ms)
[09/27/2023-09:18:41] [I] Average on 10 runs - GPU latency: 10.0344 ms - Host latency: 10.0853 ms (end to end 10.0973 ms, enqueue 9.87219 ms)
[09/27/2023-09:18:41] [I] Average on 10 runs - GPU latency: 9.97233 ms - Host latency: 10.0231 ms (end to end 10.0348 ms, enqueue 9.81057 ms)
[09/27/2023-09:18:41] [I] Average on 10 runs - GPU latency: 10.0911 ms - Host latency: 10.1421 ms (end to end 10.1524 ms, enqueue 9.95394 ms)
[09/27/2023-09:18:41] [I] Average on 10 runs - GPU latency: 10.041 ms - Host latency: 10.0919 ms (end to end 10.1018 ms, enqueue 9.80658 ms)
[09/27/2023-09:18:41] [I] Average on 10 runs - GPU latency: 10.1443 ms - Host latency: 10.1951 ms (end to end 10.206 ms, enqueue 9.9538 ms)
[09/27/2023-09:18:41] [I] Average on 10 runs - GPU latency: 10.1061 ms - Host latency: 10.157 ms (end to end 10.1698 ms, enqueue 9.98355 ms)
[09/27/2023-09:18:41] [I] Average on 10 runs - GPU latency: 10.1818 ms - Host latency: 10.2327 ms (end to end 10.2437 ms, enqueue 9.94077 ms)
[09/27/2023-09:18:41] [I] Average on 10 runs - GPU latency: 10.1613 ms - Host latency: 10.2122 ms (end to end 10.2246 ms, enqueue 9.99011 ms)
[09/27/2023-09:18:41] [I] Average on 10 runs - GPU latency: 10.3538 ms - Host latency: 10.4142 ms (end to end 10.4237 ms, enqueue 10.1414 ms)
[09/27/2023-09:18:41] [I] Average on 10 runs - GPU latency: 10.4486 ms - Host latency: 10.5115 ms (end to end 10.5203 ms, enqueue 10.2751 ms)
[09/27/2023-09:18:41] [I] Average on 10 runs - GPU latency: 10.5079 ms - Host latency: 10.5709 ms (end to end 10.5791 ms, enqueue 10.3812 ms)
[09/27/2023-09:18:41] [I] Average on 10 runs - GPU latency: 9.86721 ms - Host latency: 9.93015 ms (end to end 9.93904 ms, enqueue 9.7655 ms)
[09/27/2023-09:18:41] [I] Average on 10 runs - GPU latency: 9.85227 ms - Host latency: 9.91536 ms (end to end 9.92612 ms, enqueue 9.72375 ms)
[09/27/2023-09:18:41] [I] Average on 10 runs - GPU latency: 9.85117 ms - Host latency: 9.91418 ms (end to end 9.92402 ms, enqueue 9.75522 ms)
[09/27/2023-09:18:41] [I] Average on 10 runs - GPU latency: 9.86853 ms - Host latency: 9.93145 ms (end to end 9.94155 ms, enqueue 9.65605 ms)
[09/27/2023-09:18:41] [I] Average on 10 runs - GPU latency: 9.83875 ms - Host latency: 9.90159 ms (end to end 9.91301 ms, enqueue 9.69023 ms)
[09/27/2023-09:18:41] [I] Average on 10 runs - GPU latency: 9.91931 ms - Host latency: 9.9823 ms (end to end 9.99209 ms, enqueue 9.7384 ms)
[09/27/2023-09:18:41] [I] Average on 10 runs - GPU latency: 9.89434 ms - Host latency: 9.9594 ms (end to end 9.96948 ms, enqueue 9.77319 ms)
[09/27/2023-09:18:41] [I] Average on 10 runs - GPU latency: 10.2608 ms - Host latency: 10.3439 ms (end to end 10.3572 ms, enqueue 10.1076 ms)
[09/27/2023-09:18:41] [I] Average on 10 runs - GPU latency: 10.6794 ms - Host latency: 10.7627 ms (end to end 10.8828 ms, enqueue 10.6707 ms)
[09/27/2023-09:18:41] [I] 
[09/27/2023-09:18:41] [I] === Performance summary ===
[09/27/2023-09:18:41] [I] Throughput: 98.3727 qps
[09/27/2023-09:18:41] [I] Latency: min = 9.86865 ms, max = 13.1165 ms, mean = 10.1486 ms, median = 10.011 ms, percentile(99%) = 11.8225 ms
[09/27/2023-09:18:41] [I] End-to-End Host Latency: min = 9.87793 ms, max = 13.137 ms, mean = 10.1639 ms, median = 10.022 ms, percentile(99%) = 11.8826 ms
[09/27/2023-09:18:41] [I] Enqueue Time: min = 8.23511 ms, max = 12.8284 ms, mean = 9.94176 ms, median = 9.89532 ms, percentile(99%) = 11.9259 ms
[09/27/2023-09:18:41] [I] H2D Latency: min = 0.0478516 ms, max = 0.079834 ms, mean = 0.0548499 ms, median = 0.0482178 ms, percentile(99%) = 0.0793457 ms
[09/27/2023-09:18:41] [I] GPU Compute Time: min = 9.80591 ms, max = 13.0334 ms, mean = 10.0906 ms, median = 9.96002 ms, percentile(99%) = 11.7383 ms
[09/27/2023-09:18:41] [I] D2H Latency: min = 0.00256348 ms, max = 0.00463867 ms, mean = 0.00310619 ms, median = 0.00280762 ms, percentile(99%) = 0.00463867 ms
[09/27/2023-09:18:41] [I] Total Host Walltime: 3.0293 s
[09/27/2023-09:18:41] [I] Total GPU Compute Time: 3.007 s
[09/27/2023-09:18:41] [W] * Throughput may be bound by Enqueue Time rather than GPU Compute and the GPU may be under-utilized.
[09/27/2023-09:18:41] [W]   If not already in use, --useCudaGraph (utilize CUDA graphs where possible) may increase the throughput.
[09/27/2023-09:18:41] [I] Explanations of the performance metrics are printed in the verbose logs.
[09/27/2023-09:18:41] [I] 
&&&& PASSED TensorRT.trtexec [TensorRT v8001] # /usr/src/tensorrt/bin/trtexec --onnx=resnet50_tf.onnx --best --useDLACore=0 --allowGPUFallback
[09/27/2023-09:18:41] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 996, GPU 5795 (MiB)

I also tried your code to modify my model to be DLA-compatible but it doesn’t work.
How can I put my model only on DLA please ?

Thanks.

Topic		Replies	Views
DLA_STANDALONE error in forceToUseNvmIO Jetson AGX Xavier dla	15	1272	February 9, 2023
Cannot build a TensorRT engine for DLA from a large ONNX file Jetson Xavier NX tensorrt , nvbugs , dla	12	2623	July 21, 2021
Process killed during tensorrt conversion on Jetson orin NX (8 GB) Jetson Orin NX tensorrt	15	726	April 30, 2024
[TensorRT] Running a simple onnx model on Jetson Xavier DLA Jetson Xavier NX tensorrt , onnx	12	3017	August 10, 2022
Conver tf1 model to onnx, inference in tensorrt error Jetson Xavier NX tensorrt , tensorflow , jetson-inference , python	4	1222	October 10, 2021
How can I customize matrix multiplication on DLA Jetson AGX Orin dla	12	199	September 25, 2024
DLA performance DeepStream SDK	17	146	September 23, 2024
Tensorrt Python API has a bug in DLA usage Jetson AGX Xavier tensorrt	11	631	August 17, 2022
The inference for adding yolov5 in the deepstream example sends an error when yolov5 turns engine Metropolis Microservices for Jetson	24	459	June 7, 2024
Model onnx trt engine generation process report different results compared between PC and jetson XAVIER NX Jetson Xavier NX tensorrt	19	1020	September 28, 2022

Xavier NX does not support adaptative average pooling on DLA?

Pooling layer

Related topics