Xavier NX does not support adaptative average pooling on DLA?

Hi,

I’m trying to compile ResNet50 ONXX to TRT on the DLA in a Xavier NX. But the Adaptative Average Pooling falls back to GPU. I tried:

  1. to change the graph of my ONNX model to set count_include_pad=1 for inclusive pooling
  2. to change the jetpack version (4.5.1 to 4.6.4)

Can you tell me if the Adaptative Average Polling is supported on the DLA ? If so, how should I proceed?

Thanks

Hi,

We hope the following document may help you.

If you need further assistance, we are moving this post to the Jetson Xavier NX forum to get better help.

Thank you.

Hi,
Please check the below links, as they might answer your concerns.

Thanks!

Hi,

Based on the document, here is the constraint of the DLA pooling layer:

Pooling layer

  • Only two spatial dimension operations are supported.
  • Both FP16 and INT8 are supported.
  • Operations supported: kMAX, kAVERAGE.
  • Dimensions of the window must be in the range [1, 8].
  • Dimensions of padding must be in the range [0, 7].
  • Dimensions of stride must be in the range [1, 16].
  • With INT8 mode, input and output tensor scales must be the same.

Thanks.

Sorry for the delay and thanks all for your answers.
I changed my Adaptive Pooling by an Average Pooling. Now, the pooling runs on the DLA.

However, my ResNet50 still doesn’t run fully on DLA. I can’t use GPU. Since I changed the pooling layer, an Identity and a Shuffle layer has been added during the trt convertion but I can’t see these layers in my ONNX graph. In addition, I read on the documentation:

"For both the ElementWise equal layer and the subsequent IIdentityLayer mentioned above, explicitly set your device types to DLA and their precisions to INT8. Otherwise, these layers will run on the GPU. "

So I tried to convert my model using the following command:

/usr/src/tensorrt/bin/trtexec --onnx=resnet50_new_pool.onnx --useDLACore=0 --best --allowGPUFallback

to allow int8, fp16 and fp32 precisions. But I still have GPU fallbacks :

[09/06/2023-10:28:31] [I] [TRT] ---------- Layers Running on DLA ----------
[09/06/2023-10:28:31] [I] [TRT] [DlaLayer] {ForeignNode[/conv1/Conv.../layer4/layer4.2/relu_2/Relu]}
[09/06/2023-10:28:31] [I] [TRT] [DlaLayer] {ForeignNode[/avgpool/AveragePool.../fc/Gemm]}
[09/06/2023-10:28:31] [I] [TRT] ---------- Layers Running on GPU ----------
[09/06/2023-10:28:31] [I] [TRT] [GpuLayer] (Unnamed Layer* 119) [Identity]
[09/06/2023-10:28:31] [I] [TRT] [GpuLayer] (Unnamed Layer* 124) [Shuffle]

I also tried with ResNet34 and EfficientNet B0 but I still have the problem.

Do you have an idea to help me ?

Best regards.

Hi,

The layer is added automatically to convert the data to be DLA-compatible.
You can do this by feeding the required format directly.

For example:

/usr/src/tensorrt/bin/trtexec --inputIOFormats=fp16:hwc8 --outputIOFormats=fp16:hwc8 ...

Thanks.

Hello,

Thank you very much for your answer. However, it does not work on my DLA. Indeed, I tried several config but I always have a Segmentation Fault.

For example, I tried :

/usr/src/tensorrt/bin/trtexec --inputIOFormats=fp16:hwc8 --outputIOFormats=fp16:hwc8 --onnx=resnet50_new_pool.onnx --useDLACore=0 --allowGPUFallback

/usr/src/tensorrt/bin/trtexec --inputIOFormats=fp16:chw16 --outputIOFormats=fp16:chw16 --onnx=resnet50_new_pool.onnx --useDLACore=0 --allowGPUFallback

/usr/src/tensorrt/bin/trtexec --inputIOFormats=fp32:chw32 --outputIOFormats=fp32:chw32 --onnx=resnet50_new_pool.onnx --useDLACore=0 --allowGPUFallback

And I always get :

&&&& RUNNING TensorRT.trtexec [TensorRT v8001] # /usr/src/tensorrt/bin/trtexec --inputIOFormats=fp32:chw32 --outputIOFormats=fp32:chw32 --onnx=resnet50_new_pool.onnx --useDLACore=0 --allowGPUFallback
[09/14/2023-07:42:06] [I] === Model Options ===
[09/14/2023-07:42:06] [I] Format: ONNX
[09/14/2023-07:42:06] [I] Model: resnet50_new_pool.onnx
[09/14/2023-07:42:06] [I] Output:
[09/14/2023-07:42:06] [I] === Build Options ===
[09/14/2023-07:42:06] [I] Max batch: explicit
[09/14/2023-07:42:06] [I] Workspace: 16 MiB
[09/14/2023-07:42:06] [I] minTiming: 1
[09/14/2023-07:42:06] [I] avgTiming: 8
[09/14/2023-07:42:06] [I] Precision: FP32
[09/14/2023-07:42:06] [I] Calibration: 
[09/14/2023-07:42:06] [I] Refit: Disabled
[09/14/2023-07:42:06] [I] Sparsity: Disabled
[09/14/2023-07:42:06] [I] Safe mode: Disabled
[09/14/2023-07:42:06] [I] Restricted mode: Disabled
[09/14/2023-07:42:06] [I] Save engine: 
[09/14/2023-07:42:06] [I] Load engine: 
[09/14/2023-07:42:06] [I] NVTX verbosity: 0
[09/14/2023-07:42:06] [I] Tactic sources: Using default tactic sources
[09/14/2023-07:42:06] [I] timingCacheMode: local
[09/14/2023-07:42:06] [I] timingCacheFile: 
[09/14/2023-07:42:06] [I] Input(s): fp32:+chw32
[09/14/2023-07:42:06] [I] Output(s): fp32:+chw32
[09/14/2023-07:42:06] [I] Input build shapes: model
[09/14/2023-07:42:06] [I] Input calibration shapes: model
[09/14/2023-07:42:06] [I] === System Options ===
[09/14/2023-07:42:06] [I] Device: 0
[09/14/2023-07:42:06] [I] DLACore: 0(With GPU fallback)
[09/14/2023-07:42:06] [I] Plugins:
[09/14/2023-07:42:06] [I] === Inference Options ===
[09/14/2023-07:42:06] [I] Batch: Explicit
[09/14/2023-07:42:06] [I] Input inference shapes: model
[09/14/2023-07:42:06] [I] Iterations: 10
[09/14/2023-07:42:06] [I] Duration: 3s (+ 200ms warm up)
[09/14/2023-07:42:06] [I] Sleep time: 0ms
[09/14/2023-07:42:06] [I] Streams: 1
[09/14/2023-07:42:06] [I] ExposeDMA: Disabled
[09/14/2023-07:42:06] [I] Data transfers: Enabled
[09/14/2023-07:42:06] [I] Spin-wait: Disabled
[09/14/2023-07:42:06] [I] Multithreading: Disabled
[09/14/2023-07:42:06] [I] CUDA Graph: Disabled
[09/14/2023-07:42:06] [I] Separate profiling: Disabled
[09/14/2023-07:42:06] [I] Time Deserialize: Disabled
[09/14/2023-07:42:06] [I] Time Refit: Disabled
[09/14/2023-07:42:06] [I] Skip inference: Disabled
[09/14/2023-07:42:06] [I] Inputs:
[09/14/2023-07:42:06] [I] === Reporting Options ===
[09/14/2023-07:42:06] [I] Verbose: Disabled
[09/14/2023-07:42:06] [I] Averages: 10 inferences
[09/14/2023-07:42:06] [I] Percentile: 99
[09/14/2023-07:42:06] [I] Dump refittable layers:Disabled
[09/14/2023-07:42:06] [I] Dump output: Disabled
[09/14/2023-07:42:06] [I] Profile: Disabled
[09/14/2023-07:42:06] [I] Export timing to JSON file: 
[09/14/2023-07:42:06] [I] Export output to JSON file: 
[09/14/2023-07:42:06] [I] Export profile to JSON file: 
[09/14/2023-07:42:06] [I] 
[09/14/2023-07:42:06] [I] === Device Information ===
[09/14/2023-07:42:06] [I] Selected Device: Xavier
[09/14/2023-07:42:06] [I] Compute Capability: 7.2
[09/14/2023-07:42:06] [I] SMs: 6
[09/14/2023-07:42:06] [I] Compute Clock Rate: 1.109 GHz
[09/14/2023-07:42:06] [I] Device Global Memory: 7765 MiB
[09/14/2023-07:42:06] [I] Shared Memory per SM: 96 KiB
[09/14/2023-07:42:06] [I] Memory Bus Width: 256 bits (ECC disabled)
[09/14/2023-07:42:06] [I] Memory Clock Rate: 1.109 GHz
[09/14/2023-07:42:06] [I] 
[09/14/2023-07:42:06] [I] TensorRT version: 8001
[09/14/2023-07:42:08] [I] [TRT] [MemUsageChange] Init CUDA: CPU +354, GPU +0, now: CPU 372, GPU 5556 (MiB)
[09/14/2023-07:42:08] [I] Start parsing network model
[09/14/2023-07:42:08] [I] [TRT] ----------------------------------------------------------------
[09/14/2023-07:42:08] [I] [TRT] Input filename:   resnet50_new_pool.onnx
[09/14/2023-07:42:08] [I] [TRT] ONNX IR version:  0.0.7
[09/14/2023-07:42:08] [I] [TRT] Opset version:    14
[09/14/2023-07:42:08] [I] [TRT] Producer name:    pytorch
[09/14/2023-07:42:08] [I] [TRT] Producer version: 2.0.0
[09/14/2023-07:42:08] [I] [TRT] Domain:           
[09/14/2023-07:42:08] [I] [TRT] Model version:    0
[09/14/2023-07:42:08] [I] [TRT] Doc string:       
[09/14/2023-07:42:08] [I] [TRT] ----------------------------------------------------------------
[09/14/2023-07:42:08] [W] [TRT] onnx2trt_utils.cpp:364: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[09/14/2023-07:42:08] [I] Finish parsing network model
[09/14/2023-07:42:08] [I] [TRT] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 470, GPU 5753 (MiB)
[09/14/2023-07:42:08] [W] [TRT] Default DLA is enabled but layer (Unnamed Layer* 119) [Identity] is not supported on DLA, falling back to GPU.
[09/14/2023-07:42:08] [W] [TRT] Default DLA is enabled but layer /Flatten is not supported on DLA, falling back to GPU.
[09/14/2023-07:42:08] [W] [TRT] Default DLA is enabled but layer (Unnamed Layer* 122) [Shuffle] is not supported on DLA, falling back to GPU.
[09/14/2023-07:42:08] [W] [TRT] Default DLA is enabled but layer (Unnamed Layer* 124) [Shuffle] is not supported on DLA, falling back to GPU.
[09/14/2023-07:42:08] [I] [TRT] [MemUsageSnapshot] Builder begin: CPU 470 MiB, GPU 5753 MiB
[09/14/2023-07:42:08] [W] [TRT] output: formats with vectorized dimension require at least 3 dimensions, but dimensions are [1,1000]. Ignoring format CHW32 for type Float.
[09/14/2023-07:42:08] [E] Error[4]: [graphNodes.cpp::checkUserIOFormatsViableHelper::697] Error Code 4: Internal Error (output: no formats available.)
[09/14/2023-07:42:08] [E] Error[2]: [builder.cpp::buildSerializedNetwork::417] Error Code 2: Internal Error (Assertion enginePtr != nullptr failed.)
Segmentation fault (core dumped)

Maybe I didn’t choose the right input or output formats ? Do you have an idea ?

I’m using JetPack4.6.4 and tensorrt 8.0.1.6-1+cuda10.2 with cuda 10.2.460-1

Thanks,

Hi,

You can find the supported DLA input format below:

Could you share the output when running with inputIOFormats=fp16:hwc8 --outputIOFormats=fp16:hwc8 --fp16 with us?
Thanks.

Hello,

Thank you for your reply.

I have the same error :

/usr/src/tensorrt/bin/trtexec --inputIOFormats=fp16:hwc8 --outputIOFormats=fp16:hwc8 --onnx=resnet50_new_pool.onnx --useDLACore=0 --allowGPUFallback --fp16
&&&& RUNNING TensorRT.trtexec [TensorRT v8001] # /usr/src/tensorrt/bin/trtexec --inputIOFormats=fp16:hwc8 --outputIOFormats=fp16:hwc8 --onnx=resnet50_new_pool.onnx --useDLACore=0 --allowGPUFallback --fp16
[09/18/2023-13:50:11] [I] === Model Options ===
[09/18/2023-13:50:11] [I] Format: ONNX
[09/18/2023-13:50:11] [I] Model: resnet50_new_pool.onnx
[09/18/2023-13:50:11] [I] Output:
[09/18/2023-13:50:11] [I] === Build Options ===
[09/18/2023-13:50:11] [I] Max batch: explicit
[09/18/2023-13:50:11] [I] Workspace: 16 MiB
[09/18/2023-13:50:11] [I] minTiming: 1
[09/18/2023-13:50:11] [I] avgTiming: 8
[09/18/2023-13:50:11] [I] Precision: FP32+FP16
[09/18/2023-13:50:11] [I] Calibration: 
[09/18/2023-13:50:11] [I] Refit: Disabled
[09/18/2023-13:50:11] [I] Sparsity: Disabled
[09/18/2023-13:50:11] [I] Safe mode: Disabled
[09/18/2023-13:50:11] [I] Restricted mode: Disabled
[09/18/2023-13:50:11] [I] Save engine: 
[09/18/2023-13:50:11] [I] Load engine: 
[09/18/2023-13:50:11] [I] NVTX verbosity: 0
[09/18/2023-13:50:11] [I] Tactic sources: Using default tactic sources
[09/18/2023-13:50:11] [I] timingCacheMode: local
[09/18/2023-13:50:11] [I] timingCacheFile: 
[09/18/2023-13:50:11] [I] Input(s): fp16:+hwc8
[09/18/2023-13:50:11] [I] Output(s): fp16:+hwc8
[09/18/2023-13:50:11] [I] Input build shapes: model
[09/18/2023-13:50:11] [I] Input calibration shapes: model
[09/18/2023-13:50:11] [I] === System Options ===
[09/18/2023-13:50:11] [I] Device: 0
[09/18/2023-13:50:11] [I] DLACore: 0(With GPU fallback)
[09/18/2023-13:50:11] [I] Plugins:
[09/18/2023-13:50:11] [I] === Inference Options ===
[09/18/2023-13:50:11] [I] Batch: Explicit
[09/18/2023-13:50:11] [I] Input inference shapes: model
[09/18/2023-13:50:11] [I] Iterations: 10
[09/18/2023-13:50:11] [I] Duration: 3s (+ 200ms warm up)
[09/18/2023-13:50:11] [I] Sleep time: 0ms
[09/18/2023-13:50:11] [I] Streams: 1
[09/18/2023-13:50:11] [I] ExposeDMA: Disabled
[09/18/2023-13:50:11] [I] Data transfers: Enabled
[09/18/2023-13:50:11] [I] Spin-wait: Disabled
[09/18/2023-13:50:11] [I] Multithreading: Disabled
[09/18/2023-13:50:11] [I] CUDA Graph: Disabled
[09/18/2023-13:50:11] [I] Separate profiling: Disabled
[09/18/2023-13:50:11] [I] Time Deserialize: Disabled
[09/18/2023-13:50:11] [I] Time Refit: Disabled
[09/18/2023-13:50:11] [I] Skip inference: Disabled
[09/18/2023-13:50:11] [I] Inputs:
[09/18/2023-13:50:11] [I] === Reporting Options ===
[09/18/2023-13:50:11] [I] Verbose: Disabled
[09/18/2023-13:50:11] [I] Averages: 10 inferences
[09/18/2023-13:50:11] [I] Percentile: 99
[09/18/2023-13:50:11] [I] Dump refittable layers:Disabled
[09/18/2023-13:50:11] [I] Dump output: Disabled
[09/18/2023-13:50:11] [I] Profile: Disabled
[09/18/2023-13:50:11] [I] Export timing to JSON file: 
[09/18/2023-13:50:11] [I] Export output to JSON file: 
[09/18/2023-13:50:11] [I] Export profile to JSON file: 
[09/18/2023-13:50:11] [I] 
[09/18/2023-13:50:11] [I] === Device Information ===
[09/18/2023-13:50:11] [I] Selected Device: Xavier
[09/18/2023-13:50:11] [I] Compute Capability: 7.2
[09/18/2023-13:50:11] [I] SMs: 6
[09/18/2023-13:50:11] [I] Compute Clock Rate: 1.109 GHz
[09/18/2023-13:50:11] [I] Device Global Memory: 7765 MiB
[09/18/2023-13:50:11] [I] Shared Memory per SM: 96 KiB
[09/18/2023-13:50:11] [I] Memory Bus Width: 256 bits (ECC disabled)
[09/18/2023-13:50:11] [I] Memory Clock Rate: 1.109 GHz
[09/18/2023-13:50:11] [I] 
[09/18/2023-13:50:11] [I] TensorRT version: 8001
[09/18/2023-13:50:14] [I] [TRT] [MemUsageChange] Init CUDA: CPU +353, GPU +0, now: CPU 371, GPU 3303 (MiB)
[09/18/2023-13:50:14] [I] Start parsing network model
[09/18/2023-13:50:15] [I] [TRT] ----------------------------------------------------------------
[09/18/2023-13:50:15] [I] [TRT] Input filename:   resnet50_new_pool.onnx
[09/18/2023-13:50:15] [I] [TRT] ONNX IR version:  0.0.7
[09/18/2023-13:50:15] [I] [TRT] Opset version:    14
[09/18/2023-13:50:15] [I] [TRT] Producer name:    pytorch
[09/18/2023-13:50:15] [I] [TRT] Producer version: 2.0.0
[09/18/2023-13:50:15] [I] [TRT] Domain:           
[09/18/2023-13:50:15] [I] [TRT] Model version:    0
[09/18/2023-13:50:15] [I] [TRT] Doc string:       
[09/18/2023-13:50:15] [I] [TRT] ----------------------------------------------------------------
[09/18/2023-13:50:15] [W] [TRT] onnx2trt_utils.cpp:364: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[09/18/2023-13:50:15] [I] Finish parsing network model
[09/18/2023-13:50:15] [I] [TRT] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 470, GPU 3621 (MiB)
[09/18/2023-13:50:15] [W] [TRT] Default DLA is enabled but layer (Unnamed Layer* 119) [Identity] is not supported on DLA, falling back to GPU.
[09/18/2023-13:50:15] [W] [TRT] Default DLA is enabled but layer /Flatten is not supported on DLA, falling back to GPU.
[09/18/2023-13:50:15] [W] [TRT] Default DLA is enabled but layer (Unnamed Layer* 122) [Shuffle] is not supported on DLA, falling back to GPU.
[09/18/2023-13:50:15] [W] [TRT] Default DLA is enabled but layer (Unnamed Layer* 124) [Shuffle] is not supported on DLA, falling back to GPU.
[09/18/2023-13:50:15] [I] [TRT] [MemUsageSnapshot] Builder begin: CPU 470 MiB, GPU 3622 MiB
[09/18/2023-13:50:15] [W] [TRT] output: formats with vectorized dimension require at least 3 dimensions, but dimensions are [1,1000]. Ignoring format HWC8 for type Half.
[09/18/2023-13:50:15] [E] Error[4]: [graphNodes.cpp::checkUserIOFormatsViableHelper::697] Error Code 4: Internal Error (output: no formats available.)
[09/18/2023-13:50:15] [E] Error[2]: [builder.cpp::buildSerializedNetwork::417] Error Code 2: Internal Error (Assertion enginePtr != nullptr failed.)
Segmentation fault (core dumped)

If I understood well, there is a problem with the output but I didn’t change anything from the original ResNet50.

Thanks.

Hi,

We try to reproduce this issue with TensorRT’s model (/usr/src/tensorrt/data/resnet50/ResNet50.onnx)
But it gets stuck at a non-supported layer which seems not aligned to your observation.

...
[09/20/2023-02:15:45] [I] Finish parsing network model
[09/20/2023-02:15:45] [E] Error[2]: [network.cpp::operator()::2682] Error Code 2: Internal Error (Assertion allowGPUFallback failed. Layer 'node_of_OC2_DUMMY_0' is not supported on DLA but GPU fallback is not enabled.)
[09/20/2023-02:15:45] [E] Error[4]: [network.cpp::validate::2789] Error Code 4: Internal Error (DLA validation failed)
[09/20/2023-02:15:45] [E] Error[2]: [builder.cpp::buildSerializedNetwork::751] Error Code 2: Internal Error (Assertion engine != nullptr failed. )
[09/20/2023-02:15:45] [E] Engine could not be created from network
[09/20/2023-02:15:45] [E] Building engine failed
[09/20/2023-02:15:45] [E] Failed to create engine from model or file.
[09/20/2023-02:15:45] [E] Engine set up failed
&&&& FAILED TensorRT.trtexec [TensorRT v8502] # /usr/src/tensorrt/bin/trtexec --onnx=/usr/src/tensorrt/data/resnet50/ResNet50.onnx --useDLACore=0

Do you use a custom model? If yes, could you share the model with us?
Thanks.

Hello,

Thanks for your answer.
I don’t use custom layer, I used the ResNet50 model from torchvision. Then, I only changed the AdaptiveAveragePooling() by AvgPool2d(). Finally, I convert my model in an ONNX model. However, as you can see in my message of the 6th of September, the average pooling works well on DLA. I only have problem with the Identity and the Shuffle Layers added during the TensorRT conversion.

import torch
import torch.nn as nn

from torchvision.models import resnet50, ResNet50_Weights


model = resnet50(weights = ResNet50_Weights.IMAGENET1K_V2)

input_size = 7
output_size = 1
stride = (input_size//output_size)
kernel_size = input_size-(output_size-1)*stride
padding = 0

model_new = resnet50(weights=ResNet50_Weights.IMAGENET1K_V2)
model_new.avgpool = nn.AvgPool2d(kernel_size, 
                                 stride=stride, 
                                 padding=padding, 
                                 count_include_pad=True)

Thanks

Hi,

Would you mind also sharing the ONNX model with us?
Thanks.

Hello,

you will find enclosed the ONNX model.
Thanks,

resnet50_newpool.onnx (97.4 MB)

Hi,

We test your model and output is different compared to your log.
In our experiment, the DLA engine fails to generate due to a non-supported layer (Identity):

[09/21/2023-15:21:32] [W] [TRT] onnx2trt_utils.cpp:375: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[09/21/2023-15:21:32] [I] Finish parsing network model
[09/21/2023-15:21:32] [I] FP32 and INT8 precisions have been specified - more performance might be enabled by additionally specifying --fp16 or --best
[09/21/2023-15:21:32] [E] Error[2]: [network.cpp::operator()::2682] Error Code 2: Internal Error (Assertion allowGPUFallback failed. Layer '(Unnamed Layer* 119) [Identity]' is not supported on DLA but GPU fallback is not enabled.)
[09/21/2023-15:21:32] [E] Error[4]: [network.cpp::validate::2789] Error Code 4: Internal Error (DLA validation failed)
[09/21/2023-15:21:32] [E] Error[2]: [builder.cpp::buildSerializedNetwork::751] Error Code 2: Internal Error (Assertion engine != nullptr failed. )
[09/21/2023-15:21:32] [E] Engine could not be created from network
[09/21/2023-15:21:32] [E] Building engine failed
[09/21/2023-15:21:32] [E] Failed to create engine from model or file.
[09/21/2023-15:21:32] [E] Engine set up failed
&&&& FAILED TensorRT.trtexec [TensorRT v8502] # /usr/src/tensorrt/bin/trtexec --onnx=resnet50_newpool.onnx --int8 --useDLACore=0

Checking with polygraphy tool, the layer is added between activation and the average pooling layer.

$ git clone -b release/8.5 https://github.com/NVIDIA/TensorRT.git
$ cd TensorRT/tools/Polygraphy/
$ sudo make install
$ polygraphy convert resnet50_newpool.onnx --convert-to=onnx-like-trt-network --fp16 --tensor-formats=input.1:[hwc8] --tensor-formats=output:[hwc8] -o resnet50_newpool.pb

Is it possible to remove it? It looks like related to the padding.

Thanks.

Hello,

Thanks for your reply. I managed to remove the identity layer. The Segmentation Fault not appears anymore now. But I still have the problem of the shuffle layer as indicated in my previous messages.
I got this issue:

[09/21/2023-11:07:16] [I] [TRT] ---------- Layers Running on DLA ----------
[09/21/2023-11:07:16] [I] [TRT] [DlaLayer] {ForeignNode[/conv1/Conv.../fc/Gemm]}
[09/21/2023-11:07:16] [I] [TRT] ---------- Layers Running on GPU ----------
[09/21/2023-11:07:16] [I] [TRT] [GpuLayer] (Unnamed Layer* 123) [Shuffle]

Do you have this problem ? And do you know how to solve it ?

Thanks.

Hi,

Based on the TensorRT log, the Shuffle layer is added by the usage of the Flatten layer.

...
[09/22/2023-11:08:14] [I] Finish parsing network model
[09/22/2023-11:08:14] [W] [TRT] Layer '(Unnamed Layer* 119) [Identity]' (CAST): Unsupported on DLA. Switching this layer's device type to GPU.
[09/22/2023-11:08:14] [W] [TRT] Layer '/Flatten' (SHUFFLE): Unsupported on DLA. Switching this layer's device type to GPU.
[09/22/2023-11:08:14] [W] [TRT] Layer 'fc.weight' (CONSTANT): Unsupported on DLA. Switching this layer's device type to GPU.
[09/22/2023-11:08:14] [W] [TRT] Layer 'fc.bias' (CONSTANT): Unsupported on DLA. Switching this layer's device type to GPU.
[09/22/2023-11:08:14] [W] [TRT] Layer '(Unnamed Layer* 125) [Shuffle]' (SHUFFLE): Unsupported on DLA. Switching this layer's device type to GPU.
...

Thanks

Hi,

So, you confirm that it’s not possible to run an entire ResNet50 only on DLA?

Hi,

You can try to modify the model so it won’t need a Shuffle layer.

We have a script that can modify the model to be DLA-compatible.
However, it will need some modification for TorchVision’s ResNet50.

Could you give it a try?

  1. Install our ONNX graphsurgeon first. steps.

  2. Modify the model with the below script:

Thanks.

Thanks for your help. I will try on my ResNet50 !

I will get back to you as soon as possible.

Hello,

I tried with TensorFlow but I have the same error with shuffle layers…

/usr/src/tensorrt/bin/trtexec --onnx=resnet50_tf.onnx --best --useDLACore=0 --allowGPUFallback
&&&& RUNNING TensorRT.trtexec [TensorRT v8001] # /usr/src/tensorrt/bin/trtexec --onnx=resnet50_tf.onnx --best --useDLACore=0 --allowGPUFallback
[09/27/2023-09:18:07] [I] === Model Options ===
[09/27/2023-09:18:07] [I] Format: ONNX
[09/27/2023-09:18:07] [I] Model: resnet50_tf.onnx
[09/27/2023-09:18:07] [I] Output:
[09/27/2023-09:18:07] [I] === Build Options ===
[09/27/2023-09:18:07] [I] Max batch: explicit
[09/27/2023-09:18:07] [I] Workspace: 16 MiB
[09/27/2023-09:18:07] [I] minTiming: 1
[09/27/2023-09:18:07] [I] avgTiming: 8
[09/27/2023-09:18:07] [I] Precision: FP32+FP16+INT8
[09/27/2023-09:18:07] [I] Calibration: Dynamic
[09/27/2023-09:18:07] [I] Refit: Disabled
[09/27/2023-09:18:07] [I] Sparsity: Disabled
[09/27/2023-09:18:07] [I] Safe mode: Disabled
[09/27/2023-09:18:07] [I] Restricted mode: Disabled
[09/27/2023-09:18:07] [I] Save engine: 
[09/27/2023-09:18:07] [I] Load engine: 
[09/27/2023-09:18:07] [I] NVTX verbosity: 0
[09/27/2023-09:18:07] [I] Tactic sources: Using default tactic sources
[09/27/2023-09:18:07] [I] timingCacheMode: local
[09/27/2023-09:18:07] [I] timingCacheFile: 
[09/27/2023-09:18:07] [I] Input(s)s format: fp32:CHW
[09/27/2023-09:18:07] [I] Output(s)s format: fp32:CHW
[09/27/2023-09:18:07] [I] Input build shapes: model
[09/27/2023-09:18:07] [I] Input calibration shapes: model
[09/27/2023-09:18:07] [I] === System Options ===
[09/27/2023-09:18:07] [I] Device: 0
[09/27/2023-09:18:07] [I] DLACore: 0(With GPU fallback)
[09/27/2023-09:18:07] [I] Plugins:
[09/27/2023-09:18:07] [I] === Inference Options ===
[09/27/2023-09:18:07] [I] Batch: Explicit
[09/27/2023-09:18:07] [I] Input inference shapes: model
[09/27/2023-09:18:07] [I] Iterations: 10
[09/27/2023-09:18:07] [I] Duration: 3s (+ 200ms warm up)
[09/27/2023-09:18:07] [I] Sleep time: 0ms
[09/27/2023-09:18:07] [I] Streams: 1
[09/27/2023-09:18:07] [I] ExposeDMA: Disabled
[09/27/2023-09:18:07] [I] Data transfers: Enabled
[09/27/2023-09:18:07] [I] Spin-wait: Disabled
[09/27/2023-09:18:07] [I] Multithreading: Disabled
[09/27/2023-09:18:07] [I] CUDA Graph: Disabled
[09/27/2023-09:18:07] [I] Separate profiling: Disabled
[09/27/2023-09:18:07] [I] Time Deserialize: Disabled
[09/27/2023-09:18:07] [I] Time Refit: Disabled
[09/27/2023-09:18:07] [I] Skip inference: Disabled
[09/27/2023-09:18:07] [I] Inputs:
[09/27/2023-09:18:07] [I] === Reporting Options ===
[09/27/2023-09:18:07] [I] Verbose: Disabled
[09/27/2023-09:18:07] [I] Averages: 10 inferences
[09/27/2023-09:18:07] [I] Percentile: 99
[09/27/2023-09:18:07] [I] Dump refittable layers:Disabled
[09/27/2023-09:18:07] [I] Dump output: Disabled
[09/27/2023-09:18:07] [I] Profile: Disabled
[09/27/2023-09:18:07] [I] Export timing to JSON file: 
[09/27/2023-09:18:07] [I] Export output to JSON file: 
[09/27/2023-09:18:07] [I] Export profile to JSON file: 
[09/27/2023-09:18:07] [I] 
[09/27/2023-09:18:07] [I] === Device Information ===
[09/27/2023-09:18:07] [I] Selected Device: Xavier
[09/27/2023-09:18:07] [I] Compute Capability: 7.2
[09/27/2023-09:18:07] [I] SMs: 6
[09/27/2023-09:18:07] [I] Compute Clock Rate: 1.109 GHz
[09/27/2023-09:18:07] [I] Device Global Memory: 7773 MiB
[09/27/2023-09:18:07] [I] Shared Memory per SM: 96 KiB
[09/27/2023-09:18:07] [I] Memory Bus Width: 256 bits (ECC disabled)
[09/27/2023-09:18:07] [I] Memory Clock Rate: 1.109 GHz
[09/27/2023-09:18:07] [I] 
[09/27/2023-09:18:07] [I] TensorRT version: 8001
[09/27/2023-09:18:11] [I] [TRT] [MemUsageChange] Init CUDA: CPU +354, GPU +0, now: CPU 372, GPU 4477 (MiB)
[09/27/2023-09:18:11] [I] Start parsing network model
[09/27/2023-09:18:11] [I] [TRT] ----------------------------------------------------------------
[09/27/2023-09:18:11] [I] [TRT] Input filename:   resnet50_tf.onnx
[09/27/2023-09:18:11] [I] [TRT] ONNX IR version:  0.0.7
[09/27/2023-09:18:11] [I] [TRT] Opset version:    13
[09/27/2023-09:18:11] [I] [TRT] Producer name:    tf2onnx
[09/27/2023-09:18:11] [I] [TRT] Producer version: 1.15.1 37820d
[09/27/2023-09:18:11] [I] [TRT] Domain:           
[09/27/2023-09:18:11] [I] [TRT] Model version:    0
[09/27/2023-09:18:11] [I] [TRT] Doc string:       
[09/27/2023-09:18:11] [I] [TRT] ----------------------------------------------------------------
[09/27/2023-09:18:11] [W] [TRT] onnx2trt_utils.cpp:364: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[09/27/2023-09:18:11] [W] [TRT] ShapedWeights.cpp:173: Weights resnet50/predictions/MatMul/ReadVariableOp:0 has been transposed with permutation of (1, 0)! If you plan on overwriting the weights with the Refitter API, the new weights must be pre-transposed.
[09/27/2023-09:18:11] [I] Finish parsing network model
[09/27/2023-09:18:11] [I] [TRT] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 478, GPU 4681 (MiB)
[09/27/2023-09:18:11] [W] Dynamic dimensions required for input: input, but no shapes were provided. Automatically overriding shape to: 1x224x224x3
[09/27/2023-09:18:11] [W] [TRT] Default DLA is enabled but layer resnet50/conv1_conv/Conv2D__6 is not supported on DLA, falling back to GPU.
[09/27/2023-09:18:11] [W] [TRT] Default DLA is enabled but layer resnet50/pool1_pad/Pad is not supported on DLA, falling back to GPU.
[09/27/2023-09:18:11] [W] [TRT] Default DLA is enabled but layer resnet50/avg_pool/Mean is not supported on DLA, falling back to GPU.
[09/27/2023-09:18:11] [W] [TRT] DLA only supports FP16 and Int8 precision type. Switching (Unnamed Layer* 122) [Shape] device type to GPU.
[09/27/2023-09:18:11] [W] [TRT] Default DLA is enabled but layer (Unnamed Layer* 123) [Constant] is not supported on DLA, falling back to GPU.
[09/27/2023-09:18:11] [W] [TRT] Default DLA is enabled but layer (Unnamed Layer* 124) [Gather] is not supported on DLA, falling back to GPU.
[09/27/2023-09:18:11] [W] [TRT] Default DLA is enabled but layer resnet50/avg_pool/Mean_Squeeze__614 is not supported on DLA, falling back to GPU.
[09/27/2023-09:18:11] [W] [TRT] Default DLA is enabled but layer resnet50/predictions/MatMul/ReadVariableOp:0 is not supported on DLA, falling back to GPU.
[09/27/2023-09:18:11] [W] [TRT] DLA only supports FP16 and Int8 precision type. Switching (Unnamed Layer* 127) [Shape] device type to GPU.
[09/27/2023-09:18:11] [W] [TRT] Default DLA is enabled but layer (Unnamed Layer* 128) [Constant] is not supported on DLA, falling back to GPU.
[09/27/2023-09:18:11] [W] [TRT] (Unnamed Layer* 129) [Concatenation]: DLA only supports concatenation on the C dimension.
[09/27/2023-09:18:11] [W] [TRT] Default DLA is enabled but layer (Unnamed Layer* 129) [Concatenation] is not supported on DLA, falling back to GPU.
[09/27/2023-09:18:11] [W] [TRT] Default DLA is enabled but layer (Unnamed Layer* 130) [Constant] is not supported on DLA, falling back to GPU.
[09/27/2023-09:18:11] [W] [TRT] Default DLA is enabled but layer (Unnamed Layer* 131) [Gather] is not supported on DLA, falling back to GPU.
[09/27/2023-09:18:11] [W] [TRT] Default DLA is enabled but layer (Unnamed Layer* 132) [Shuffle] is not supported on DLA, falling back to GPU.
[09/27/2023-09:18:11] [W] [TRT] DLA only supports FP16 and Int8 precision type. Switching (Unnamed Layer* 134) [Shape] device type to GPU.
[09/27/2023-09:18:11] [W] [TRT] Default DLA is enabled but layer (Unnamed Layer* 135) [Constant] is not supported on DLA, falling back to GPU.
[09/27/2023-09:18:11] [W] [TRT] Default DLA is enabled but layer (Unnamed Layer* 136) [Gather] is not supported on DLA, falling back to GPU.
[09/27/2023-09:18:11] [W] [TRT] Default DLA is enabled but layer (Unnamed Layer* 137) [Shuffle] is not supported on DLA, falling back to GPU.
[09/27/2023-09:18:11] [W] [TRT] Default DLA is enabled but layer resnet50/predictions/BiasAdd/ReadVariableOp:0 is not supported on DLA, falling back to GPU.
[09/27/2023-09:18:11] [W] [TRT] Default DLA is enabled but layer (Unnamed Layer* 139) [Shuffle] is not supported on DLA, falling back to GPU.
[09/27/2023-09:18:11] [W] [TRT] Default DLA is enabled but layer resnet50/predictions/Softmax is not supported on DLA, falling back to GPU.
[09/27/2023-09:18:11] [W] [TRT] Default DLA is enabled but layer (Unnamed Layer* 142) [Shuffle] is not supported on DLA, falling back to GPU.
[09/27/2023-09:18:11] [W] [TRT] DLA only supports FP16 and Int8 precision type. Switching (Unnamed Layer* 143) [Shape] device type to GPU.
[09/27/2023-09:18:11] [I] [TRT] [MemUsageSnapshot] Builder begin: CPU 478 MiB, GPU 4681 MiB
[09/27/2023-09:18:11] [W] [TRT] Calibrator is not being used. Users must provide dynamic range for all tensors that are not Int32.
[09/27/2023-09:18:12] [W] [TRT] Input tensor has less than 4 dimensions for resnet50/predictions/BiasAdd. At least one shuffle layer will be inserted which cannot run on DLA.
[09/27/2023-09:18:13] [I] [TRT] ---------- Layers Running on DLA ----------
[09/27/2023-09:18:13] [I] [TRT] [DlaLayer] {ForeignNode[Conv__435...resnet50/conv1_relu/Relu]}
[09/27/2023-09:18:13] [I] [TRT] [DlaLayer] {ForeignNode[resnet50/pool1_pool/MaxPool...resnet50/conv5_block3_out/Relu]}
[09/27/2023-09:18:13] [I] [TRT] [DlaLayer] {ForeignNode[resnet50/predictions/MatMul]}
[09/27/2023-09:18:13] [I] [TRT] [DlaLayer] {ForeignNode[resnet50/predictions/BiasAdd]}
[09/27/2023-09:18:13] [I] [TRT] ---------- Layers Running on GPU ----------
[09/27/2023-09:18:13] [I] [TRT] [GpuLayer] resnet50/predictions/BiasAdd/ReadVariableOp:0
[09/27/2023-09:18:13] [I] [TRT] [GpuLayer] resnet50/conv1_conv/Conv2D__6
[09/27/2023-09:18:13] [I] [TRT] [GpuLayer] (Unnamed Layer* 139) [Shuffle]
[09/27/2023-09:18:13] [I] [TRT] [GpuLayer] resnet50/pool1_pad/Pad
[09/27/2023-09:18:13] [I] [TRT] [GpuLayer] resnet50/avg_pool/Mean
[09/27/2023-09:18:13] [I] [TRT] [GpuLayer] (Unnamed Layer* 137) [Shuffle]
[09/27/2023-09:18:13] [I] [TRT] [GpuLayer] shuffle_resnet50/predictions/MatMul:0
[09/27/2023-09:18:13] [I] [TRT] [GpuLayer] shuffle_(Unnamed Layer* 139) [Shuffle]_output
[09/27/2023-09:18:13] [I] [TRT] [GpuLayer] shuffle_resnet50/predictions/BiasAdd:0
[09/27/2023-09:18:13] [I] [TRT] [GpuLayer] resnet50/predictions/Softmax
[09/27/2023-09:18:15] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +192, GPU +250, now: CPU 706, GPU 4971 (MiB)
[09/27/2023-09:18:18] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +307, GPU +513, now: CPU 1013, GPU 5484 (MiB)
[09/27/2023-09:18:18] [W] [TRT] Detected invalid timing cache, setup a local cache instead
[09/27/2023-09:18:37] [W] [TRT] No implementation obeys reformatting-free rules, at least 2 reformatting nodes are needed, now picking the fastest path instead.
[09/27/2023-09:18:37] [I] [TRT] Detected 1 inputs and 1 output network tensors.
[09/27/2023-09:18:37] [I] [TRT] Total Host Persistent Memory: 3408
[09/27/2023-09:18:37] [I] [TRT] Total Device Persistent Memory: 0
[09/27/2023-09:18:37] [I] [TRT] Total Scratch Memory: 0
[09/27/2023-09:18:37] [I] [TRT] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 65 MiB, GPU 13 MiB
[09/27/2023-09:18:37] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +1, GPU +0, now: CPU 1102, GPU 5789 (MiB)
[09/27/2023-09:18:37] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +0, now: CPU 1102, GPU 5789 (MiB)
[09/27/2023-09:18:37] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 1101, GPU 5789 (MiB)
[09/27/2023-09:18:37] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 1101, GPU 5789 (MiB)
[09/27/2023-09:18:37] [I] [TRT] [MemUsageSnapshot] Builder end: CPU 1101 MiB, GPU 5789 MiB
[09/27/2023-09:18:38] [I] [TRT] Loaded engine size: 65 MB
[09/27/2023-09:18:38] [I] [TRT] [MemUsageSnapshot] deserializeCudaEngine begin: CPU 1101 MiB, GPU 5792 MiB
[09/27/2023-09:18:38] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 1167, GPU 5854 (MiB)
[09/27/2023-09:18:38] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +0, now: CPU 1167, GPU 5854 (MiB)
[09/27/2023-09:18:38] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 1167, GPU 5854 (MiB)
[09/27/2023-09:18:38] [I] [TRT] [MemUsageSnapshot] deserializeCudaEngine end: CPU 1167 MiB, GPU 5854 MiB
[09/27/2023-09:18:38] [I] Engine built in 30.4537 sec.
[09/27/2023-09:18:38] [I] [TRT] [MemUsageSnapshot] ExecutionContext creation begin: CPU 995 MiB, GPU 5788 MiB
[09/27/2023-09:18:38] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 995, GPU 5788 (MiB)
[09/27/2023-09:18:38] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +0, now: CPU 995, GPU 5788 (MiB)
[09/27/2023-09:18:38] [I] [TRT] [MemUsageSnapshot] ExecutionContext creation end: CPU 1061 MiB, GPU 5809 MiB
[09/27/2023-09:18:38] [I] Created input binding for input with dimensions 1x224x224x3
[09/27/2023-09:18:38] [I] Created output binding for predictions with dimensions 1x1000
[09/27/2023-09:18:38] [I] Starting inference
[09/27/2023-09:18:41] [I] Warmup completed 18 queries over 200 ms
[09/27/2023-09:18:41] [I] Timing trace has 298 queries over 3.0293 s
[09/27/2023-09:18:41] [I] 
[09/27/2023-09:18:41] [I] === Trace details ===
[09/27/2023-09:18:41] [I] Trace averages of 10 runs:
[09/27/2023-09:18:41] [I] Average on 10 runs - GPU latency: 9.94338 ms - Host latency: 9.99422 ms (end to end 10.003 ms, enqueue 9.78402 ms)
[09/27/2023-09:18:41] [I] Average on 10 runs - GPU latency: 9.93923 ms - Host latency: 9.99004 ms (end to end 10.0002 ms, enqueue 9.80433 ms)
[09/27/2023-09:18:41] [I] Average on 10 runs - GPU latency: 9.97124 ms - Host latency: 10.0221 ms (end to end 10.0319 ms, enqueue 9.8399 ms)
[09/27/2023-09:18:41] [I] Average on 10 runs - GPU latency: 9.94576 ms - Host latency: 9.99662 ms (end to end 10.0036 ms, enqueue 9.814 ms)
[09/27/2023-09:18:41] [I] Average on 10 runs - GPU latency: 9.95644 ms - Host latency: 10.0073 ms (end to end 10.017 ms, enqueue 9.78196 ms)
[09/27/2023-09:18:41] [I] Average on 10 runs - GPU latency: 9.94634 ms - Host latency: 9.99722 ms (end to end 10.0055 ms, enqueue 9.82343 ms)
[09/27/2023-09:18:41] [I] Average on 10 runs - GPU latency: 9.95096 ms - Host latency: 10.0018 ms (end to end 10.0099 ms, enqueue 9.84394 ms)
[09/27/2023-09:18:41] [I] Average on 10 runs - GPU latency: 10.0939 ms - Host latency: 10.145 ms (end to end 10.1978 ms, enqueue 10.0139 ms)
[09/27/2023-09:18:41] [I] Average on 10 runs - GPU latency: 10.2422 ms - Host latency: 10.2931 ms (end to end 10.3034 ms, enqueue 10.0641 ms)
[09/27/2023-09:18:41] [I] Average on 10 runs - GPU latency: 10.0344 ms - Host latency: 10.0853 ms (end to end 10.0973 ms, enqueue 9.87219 ms)
[09/27/2023-09:18:41] [I] Average on 10 runs - GPU latency: 9.97233 ms - Host latency: 10.0231 ms (end to end 10.0348 ms, enqueue 9.81057 ms)
[09/27/2023-09:18:41] [I] Average on 10 runs - GPU latency: 10.0911 ms - Host latency: 10.1421 ms (end to end 10.1524 ms, enqueue 9.95394 ms)
[09/27/2023-09:18:41] [I] Average on 10 runs - GPU latency: 10.041 ms - Host latency: 10.0919 ms (end to end 10.1018 ms, enqueue 9.80658 ms)
[09/27/2023-09:18:41] [I] Average on 10 runs - GPU latency: 10.1443 ms - Host latency: 10.1951 ms (end to end 10.206 ms, enqueue 9.9538 ms)
[09/27/2023-09:18:41] [I] Average on 10 runs - GPU latency: 10.1061 ms - Host latency: 10.157 ms (end to end 10.1698 ms, enqueue 9.98355 ms)
[09/27/2023-09:18:41] [I] Average on 10 runs - GPU latency: 10.1818 ms - Host latency: 10.2327 ms (end to end 10.2437 ms, enqueue 9.94077 ms)
[09/27/2023-09:18:41] [I] Average on 10 runs - GPU latency: 10.1613 ms - Host latency: 10.2122 ms (end to end 10.2246 ms, enqueue 9.99011 ms)
[09/27/2023-09:18:41] [I] Average on 10 runs - GPU latency: 10.3538 ms - Host latency: 10.4142 ms (end to end 10.4237 ms, enqueue 10.1414 ms)
[09/27/2023-09:18:41] [I] Average on 10 runs - GPU latency: 10.4486 ms - Host latency: 10.5115 ms (end to end 10.5203 ms, enqueue 10.2751 ms)
[09/27/2023-09:18:41] [I] Average on 10 runs - GPU latency: 10.5079 ms - Host latency: 10.5709 ms (end to end 10.5791 ms, enqueue 10.3812 ms)
[09/27/2023-09:18:41] [I] Average on 10 runs - GPU latency: 9.86721 ms - Host latency: 9.93015 ms (end to end 9.93904 ms, enqueue 9.7655 ms)
[09/27/2023-09:18:41] [I] Average on 10 runs - GPU latency: 9.85227 ms - Host latency: 9.91536 ms (end to end 9.92612 ms, enqueue 9.72375 ms)
[09/27/2023-09:18:41] [I] Average on 10 runs - GPU latency: 9.85117 ms - Host latency: 9.91418 ms (end to end 9.92402 ms, enqueue 9.75522 ms)
[09/27/2023-09:18:41] [I] Average on 10 runs - GPU latency: 9.86853 ms - Host latency: 9.93145 ms (end to end 9.94155 ms, enqueue 9.65605 ms)
[09/27/2023-09:18:41] [I] Average on 10 runs - GPU latency: 9.83875 ms - Host latency: 9.90159 ms (end to end 9.91301 ms, enqueue 9.69023 ms)
[09/27/2023-09:18:41] [I] Average on 10 runs - GPU latency: 9.91931 ms - Host latency: 9.9823 ms (end to end 9.99209 ms, enqueue 9.7384 ms)
[09/27/2023-09:18:41] [I] Average on 10 runs - GPU latency: 9.89434 ms - Host latency: 9.9594 ms (end to end 9.96948 ms, enqueue 9.77319 ms)
[09/27/2023-09:18:41] [I] Average on 10 runs - GPU latency: 10.2608 ms - Host latency: 10.3439 ms (end to end 10.3572 ms, enqueue 10.1076 ms)
[09/27/2023-09:18:41] [I] Average on 10 runs - GPU latency: 10.6794 ms - Host latency: 10.7627 ms (end to end 10.8828 ms, enqueue 10.6707 ms)
[09/27/2023-09:18:41] [I] 
[09/27/2023-09:18:41] [I] === Performance summary ===
[09/27/2023-09:18:41] [I] Throughput: 98.3727 qps
[09/27/2023-09:18:41] [I] Latency: min = 9.86865 ms, max = 13.1165 ms, mean = 10.1486 ms, median = 10.011 ms, percentile(99%) = 11.8225 ms
[09/27/2023-09:18:41] [I] End-to-End Host Latency: min = 9.87793 ms, max = 13.137 ms, mean = 10.1639 ms, median = 10.022 ms, percentile(99%) = 11.8826 ms
[09/27/2023-09:18:41] [I] Enqueue Time: min = 8.23511 ms, max = 12.8284 ms, mean = 9.94176 ms, median = 9.89532 ms, percentile(99%) = 11.9259 ms
[09/27/2023-09:18:41] [I] H2D Latency: min = 0.0478516 ms, max = 0.079834 ms, mean = 0.0548499 ms, median = 0.0482178 ms, percentile(99%) = 0.0793457 ms
[09/27/2023-09:18:41] [I] GPU Compute Time: min = 9.80591 ms, max = 13.0334 ms, mean = 10.0906 ms, median = 9.96002 ms, percentile(99%) = 11.7383 ms
[09/27/2023-09:18:41] [I] D2H Latency: min = 0.00256348 ms, max = 0.00463867 ms, mean = 0.00310619 ms, median = 0.00280762 ms, percentile(99%) = 0.00463867 ms
[09/27/2023-09:18:41] [I] Total Host Walltime: 3.0293 s
[09/27/2023-09:18:41] [I] Total GPU Compute Time: 3.007 s
[09/27/2023-09:18:41] [W] * Throughput may be bound by Enqueue Time rather than GPU Compute and the GPU may be under-utilized.
[09/27/2023-09:18:41] [W]   If not already in use, --useCudaGraph (utilize CUDA graphs where possible) may increase the throughput.
[09/27/2023-09:18:41] [I] Explanations of the performance metrics are printed in the verbose logs.
[09/27/2023-09:18:41] [I] 
&&&& PASSED TensorRT.trtexec [TensorRT v8001] # /usr/src/tensorrt/bin/trtexec --onnx=resnet50_tf.onnx --best --useDLACore=0 --allowGPUFallback
[09/27/2023-09:18:41] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 996, GPU 5795 (MiB)

I also tried your code to modify my model to be DLA-compatible but it doesn’t work.
How can I put my model only on DLA please ?

Thanks.