DLA performance

raghavendra.ramya · August 9, 2024, 6:27pm

Please provide complete information as applicable to your setup.

• Hardware Platform (Jetson / GPU) Jetson AGX Orin
• DeepStream Version 6.3
• JetPack Version (valid for Jetson only) JP 5.1.2
• TensorRT Version
• NVIDIA GPU Driver Version (valid for GPU only)
• Issue Type( questions, new requirements, bugs) question
• How to reproduce the issue ? (This is for bugs. Including which sample app is using, the configuration files content, the command line used and other details for reproducing)
• Requirement details( This is for new requirement. Including the module name-for which plugin or for which sample application, the function description)

We are trying to run a Yolov8-small model that has been trained on images of size 1400*512, with INT8 precision.

I am setting the parameters to generate a TRT engine that can be run on the DLA. As it generates the engine, I see messages that several layers are incompatible to be run on the DLA. During inference, I also see that the GPU utilization drops, and the DLA is active. Here are the numbers I measured when I run it with DLA+GPU and GPU alone.

With DLA:

44 FPS

Power consumed: 20W approx

Frame Latency: 215 ms on avg

GIE latency: 83 ms

GPU Percentage: 65 to 74

With GPU:

105 FPS

Power consumed: 20W approx

Frame Latency: 125ms on avg

GIE latency: 43ms on avg

GPU Percentage: Close to 100

The power consumed is not lower than when I run Deepstream entirely on the GPU. Am I missing something? Is Yolov8s a network that may not be able to run on the DLA well?
Is there a difference in performance if the plan/engine file is generated by Deepstream versus trtexec?
Kindly help us with answers.

mchi · August 10, 2024, 1:18am

Hi @raghavendra.ramya ,
What’s the batch you are using?
Does your model support dynamic batch?

Can you refer to command below to benchmark your model on your device and share with the output of the last command line which includes “–dumpProfile --dumpProfile”?

$ sudo nvpmodel -m 0
$ sudo jetson_clocks
$ /usr/src/tensorrt/bin/trtexec --onnx=yolov8s_dynamicb.onnx --int8 --fp16 --best --minShapes=images:8x3x640x640 --optShapes=images:8x3x640x640 --maxShapes=images:8x3x640x640 --useDLACore=1 --allowGPUFallback --saveEngine=./yolov8s_dla_b8_int8.engine
$ /usr/src/tensorrt/bin/trtexec --loadEngine=yolov8s_dla_b8_int8.engine --useDLACore=1 --dumpProfile --dumpProfile

raghavendra.ramya · August 11, 2024, 7:27pm

I am currently just using a batch size of 1.
When I run the command to generate the engine like you suggested, I am getting the following error:
/usr/src/tensorrt/bin/trtexec --onnx=m1_1408.onnx --int8 --fp16 --best --minShapes=input:8x3x640x640 --optShapes=input:8x3x640x640 --maxShapes=input:8x3x640x640 --useDLACore=1 --allowGPUFallback --saveEngine=./yolov8s_dla_b8_int8.engine
&&&& RUNNING TensorRT.trtexec [TensorRT v8502] # /usr/src/tensorrt/bin/trtexec --onnx=m1_1408.onnx --int8 --fp16 --best --minShapes=input:8x3x640x640 --optShapes=input:8x3x640x640 --maxShapes=input:8x3x640x640 --useDLACore=1 --allowGPUFallback --saveEngine=./yolov8s_dla_b8_int8.engine
[08/11/2024-14:22:09] [I] === Model Options ===
[08/11/2024-14:22:09] [I] Format: ONNX
[08/11/2024-14:22:09] [I] Model: m1_1408.onnx
[08/11/2024-14:22:09] [I] Output:
[08/11/2024-14:22:09] [I] === Build Options ===
[08/11/2024-14:22:09] [I] Max batch: explicit batch
[08/11/2024-14:22:09] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default
[08/11/2024-14:22:09] [I] minTiming: 1
[08/11/2024-14:22:09] [I] avgTiming: 8
[08/11/2024-14:22:09] [I] Precision: FP32+FP16+INT8
[08/11/2024-14:22:09] [I] LayerPrecisions:
[08/11/2024-14:22:09] [I] Calibration: Dynamic
[08/11/2024-14:22:09] [I] Refit: Disabled
[08/11/2024-14:22:09] [I] Sparsity: Disabled
[08/11/2024-14:22:09] [I] Safe mode: Disabled
[08/11/2024-14:22:09] [I] DirectIO mode: Disabled
[08/11/2024-14:22:09] [I] Restricted mode: Disabled
[08/11/2024-14:22:09] [I] Build only: Disabled
[08/11/2024-14:22:09] [I] Save engine: ./yolov8s_dla_b8_int8.engine
[08/11/2024-14:22:09] [I] Load engine:
[08/11/2024-14:22:09] [I] Profiling verbosity: 0
[08/11/2024-14:22:09] [I] Tactic sources: Using default tactic sources
[08/11/2024-14:22:09] [I] timingCacheMode: local
[08/11/2024-14:22:09] [I] timingCacheFile:
[08/11/2024-14:22:09] [I] Heuristic: Disabled
[08/11/2024-14:22:09] [I] Preview Features: Use default preview flags.
[08/11/2024-14:22:09] [I] Input(s)s format: fp32:CHW
[08/11/2024-14:22:09] [I] Output(s)s format: fp32:CHW
[08/11/2024-14:22:09] [I] Input build shape: input=8x3x640x640+8x3x640x640+8x3x640x640
[08/11/2024-14:22:09] [I] Input calibration shapes: model
[08/11/2024-14:22:09] [I] === System Options ===
[08/11/2024-14:22:09] [I] Device: 0
[08/11/2024-14:22:09] [I] DLACore: 1(With GPU fallback)
[08/11/2024-14:22:09] [I] Plugins:
[08/11/2024-14:22:09] [I] === Inference Options ===
[08/11/2024-14:22:09] [I] Batch: Explicit
[08/11/2024-14:22:09] [I] Input inference shape: input=8x3x640x640
[08/11/2024-14:22:09] [I] Iterations: 10
[08/11/2024-14:22:09] [I] Duration: 3s (+ 200ms warm up)
[08/11/2024-14:22:09] [I] Sleep time: 0ms
[08/11/2024-14:22:09] [I] Idle time: 0ms
[08/11/2024-14:22:09] [I] Streams: 1
[08/11/2024-14:22:09] [I] ExposeDMA: Disabled
[08/11/2024-14:22:09] [I] Data transfers: Enabled
[08/11/2024-14:22:09] [I] Spin-wait: Disabled
[08/11/2024-14:22:09] [I] Multithreading: Disabled
[08/11/2024-14:22:09] [I] CUDA Graph: Disabled
[08/11/2024-14:22:09] [I] Separate profiling: Disabled
[08/11/2024-14:22:09] [I] Time Deserialize: Disabled
[08/11/2024-14:22:09] [I] Time Refit: Disabled
[08/11/2024-14:22:09] [I] NVTX verbosity: 0
[08/11/2024-14:22:09] [I] Persistent Cache Ratio: 0
[08/11/2024-14:22:09] [I] Inputs:
[08/11/2024-14:22:09] [I] === Reporting Options ===
[08/11/2024-14:22:09] [I] Verbose: Disabled
[08/11/2024-14:22:09] [I] Averages: 10 inferences
[08/11/2024-14:22:09] [I] Percentiles: 90,95,99
[08/11/2024-14:22:09] [I] Dump refittable layers:Disabled
[08/11/2024-14:22:09] [I] Dump output: Disabled
[08/11/2024-14:22:09] [I] Profile: Disabled
[08/11/2024-14:22:09] [I] Export timing to JSON file:
[08/11/2024-14:22:09] [I] Export output to JSON file:
[08/11/2024-14:22:09] [I] Export profile to JSON file:
[08/11/2024-14:22:09] [I]
[08/11/2024-14:22:09] [I] === Device Information ===
[08/11/2024-14:22:09] [I] Selected Device: Orin
[08/11/2024-14:22:09] [I] Compute Capability: 8.7
[08/11/2024-14:22:09] [I] SMs: 16
[08/11/2024-14:22:09] [I] Compute Clock Rate: 1.3 GHz
[08/11/2024-14:22:09] [I] Device Global Memory: 30592 MiB
[08/11/2024-14:22:09] [I] Shared Memory per SM: 164 KiB
[08/11/2024-14:22:09] [I] Memory Bus Width: 256 bits (ECC disabled)
[08/11/2024-14:22:09] [I] Memory Clock Rate: 1.3 GHz
[08/11/2024-14:22:09] [I]
[08/11/2024-14:22:09] [I] TensorRT version: 8.5.2
[08/11/2024-14:22:10] [I] [TRT] [MemUsageChange] Init CUDA: CPU +220, GPU +0, now: CPU 249, GPU 5615 (MiB)
[08/11/2024-14:22:11] [I] [TRT] [MemUsageChange] Init builder kernel library: CPU +302, GPU +293, now: CPU 574, GPU 5928 (MiB)
[08/11/2024-14:22:11] [I] Start parsing network model
[08/11/2024-14:22:11] [I] [TRT] ----------------------------------------------------------------
[08/11/2024-14:22:11] [I] [TRT] Input filename: m1_1408.onnx
[08/11/2024-14:22:11] [I] [TRT] ONNX IR version: 0.0.8
[08/11/2024-14:22:11] [I] [TRT] Opset version: 16
[08/11/2024-14:22:11] [I] [TRT] Producer name: pytorch
[08/11/2024-14:22:11] [I] [TRT] Producer version: 2.2.0
[08/11/2024-14:22:11] [I] [TRT] Domain:
[08/11/2024-14:22:11] [I] [TRT] Model version: 0
[08/11/2024-14:22:11] [I] [TRT] Doc string:
[08/11/2024-14:22:11] [I] [TRT] ----------------------------------------------------------------
[08/11/2024-14:22:11] [W] [TRT] onnx2trt_utils.cpp:375: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[08/11/2024-14:22:11] [W] [TRT] onnx2trt_utils.cpp:403: One or more weights outside the range of INT32 was clamped
[08/11/2024-14:22:11] [W] [TRT] Tensor DataType is determined at build time for tensors not marked as input or output.
[08/11/2024-14:22:11] [I] Finish parsing network model
[08/11/2024-14:22:11] [E] Static model does not take explicit shapes since the shape of inference tensors will be determined by the model itself
[08/11/2024-14:22:11] [E] Network And Config setup failed
[08/11/2024-14:22:11] [E] Building engine failed
[08/11/2024-14:22:11] [E] Failed to create engine from model or file.
[08/11/2024-14:22:11] [E] Engine set up failed
&&&& FAILED TensorRT.trtexec [TensorRT v8502] # /usr/src/tensorrt/bin/trtexec --onnx=m1_1408.onnx --int8 --fp16 --best --minShapes=input:8x3x640x640 --optShapes=input:8x3x640x640 --maxShapes=input:8x3x640x640 --useDLACore=1 --allowGPUFallback --saveEngine=./yolov8s_dla_b8_int8.engine

I am guessing this means the model does not support dynamic batch?
Can you kindly let me know why we would want dynamic batches? And what are we trying to analyze using the dumpProfile.
Thank you for your time.

mchi · August 12, 2024, 1:26am

Hi @raghavendra.ramya ,
From the log, yes, you should be using static batch.

By dynamic batch, you can benchmark the model with different batch, higher batch may get better performance.

As you are batch=1, please use commands below to build engine and profile the engine runtime (with layer-wise runtime) and which layers runs on DLA and fallback to GPU.

$ /usr/src/tensorrt/bin/trtexec --onnx=m1_1408.onnx --int8 --fp16 --best --minShapes=input:1x3x640x640 --optShapes=input:1x3x640x640 --maxShapes=input:1x3x640x640 --useDLACore=1 --allowGPUFallback --saveEngine=./yolov8s_dla_b1_int8.engine
$ /usr/src/tensorrt/bin/trtexec --loadEngine=yolov8s_dla_b1_int8.engine --useDLACore=1 --dumpProfile --dumpProfile

raghavendra.ramya · August 12, 2024, 5:00pm

/usr/src/tensorrt/bin/trtexec --onnx=m1_1408.onnx --int8 --fp16 --best --minShapes=input:1x3x640x640 --optShapes=input:1x3x640x640 --maxShapes=input:1x3x640x640 --useDLACore=1 --allowGPUFallback --saveEngine=./yolov8s_dla_b1_int8.engine
&&&& RUNNING TensorRT.trtexec [TensorRT v8502] # /usr/src/tensorrt/bin/trtexec --onnx=m1_1408.onnx --int8 --fp16 --best --minShapes=input:1x3x640x640 --optShapes=input:1x3x640x640 --maxShapes=input:1x3x640x640 --useDLACore=1 --allowGPUFallback --saveEngine=./yolov8s_dla_b1_int8.engine
[08/12/2024-11:59:09] [I] === Model Options ===
[08/12/2024-11:59:09] [I] Format: ONNX
[08/12/2024-11:59:09] [I] Model: m1_1408.onnx
[08/12/2024-11:59:09] [I] Output:
[08/12/2024-11:59:09] [I] === Build Options ===
[08/12/2024-11:59:09] [I] Max batch: explicit batch
[08/12/2024-11:59:09] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default
[08/12/2024-11:59:09] [I] minTiming: 1
[08/12/2024-11:59:09] [I] avgTiming: 8
[08/12/2024-11:59:09] [I] Precision: FP32+FP16+INT8
[08/12/2024-11:59:09] [I] LayerPrecisions:
[08/12/2024-11:59:09] [I] Calibration: Dynamic
[08/12/2024-11:59:09] [I] Refit: Disabled
[08/12/2024-11:59:09] [I] Sparsity: Disabled
[08/12/2024-11:59:09] [I] Safe mode: Disabled
[08/12/2024-11:59:09] [I] DirectIO mode: Disabled
[08/12/2024-11:59:09] [I] Restricted mode: Disabled
[08/12/2024-11:59:09] [I] Build only: Disabled
[08/12/2024-11:59:09] [I] Save engine: ./yolov8s_dla_b1_int8.engine
[08/12/2024-11:59:09] [I] Load engine:
[08/12/2024-11:59:09] [I] Profiling verbosity: 0
[08/12/2024-11:59:09] [I] Tactic sources: Using default tactic sources
[08/12/2024-11:59:09] [I] timingCacheMode: local
[08/12/2024-11:59:09] [I] timingCacheFile:
[08/12/2024-11:59:09] [I] Heuristic: Disabled
[08/12/2024-11:59:09] [I] Preview Features: Use default preview flags.
[08/12/2024-11:59:09] [I] Input(s)s format: fp32:CHW
[08/12/2024-11:59:09] [I] Output(s)s format: fp32:CHW
[08/12/2024-11:59:09] [I] Input build shape: input=1x3x640x640+1x3x640x640+1x3x640x640
[08/12/2024-11:59:09] [I] Input calibration shapes: model
[08/12/2024-11:59:09] [I] === System Options ===
[08/12/2024-11:59:09] [I] Device: 0
[08/12/2024-11:59:09] [I] DLACore: 1(With GPU fallback)
[08/12/2024-11:59:09] [I] Plugins:
[08/12/2024-11:59:09] [I] === Inference Options ===
[08/12/2024-11:59:09] [I] Batch: Explicit
[08/12/2024-11:59:09] [I] Input inference shape: input=1x3x640x640
[08/12/2024-11:59:09] [I] Iterations: 10
[08/12/2024-11:59:09] [I] Duration: 3s (+ 200ms warm up)
[08/12/2024-11:59:09] [I] Sleep time: 0ms
[08/12/2024-11:59:09] [I] Idle time: 0ms
[08/12/2024-11:59:09] [I] Streams: 1
[08/12/2024-11:59:09] [I] ExposeDMA: Disabled
[08/12/2024-11:59:09] [I] Data transfers: Enabled
[08/12/2024-11:59:09] [I] Spin-wait: Disabled
[08/12/2024-11:59:09] [I] Multithreading: Disabled
[08/12/2024-11:59:09] [I] CUDA Graph: Disabled
[08/12/2024-11:59:09] [I] Separate profiling: Disabled
[08/12/2024-11:59:09] [I] Time Deserialize: Disabled
[08/12/2024-11:59:09] [I] Time Refit: Disabled
[08/12/2024-11:59:09] [I] NVTX verbosity: 0
[08/12/2024-11:59:09] [I] Persistent Cache Ratio: 0
[08/12/2024-11:59:09] [I] Inputs:
[08/12/2024-11:59:09] [I] === Reporting Options ===
[08/12/2024-11:59:09] [I] Verbose: Disabled
[08/12/2024-11:59:09] [I] Averages: 10 inferences
[08/12/2024-11:59:09] [I] Percentiles: 90,95,99
[08/12/2024-11:59:09] [I] Dump refittable layers:Disabled
[08/12/2024-11:59:09] [I] Dump output: Disabled
[08/12/2024-11:59:09] [I] Profile: Disabled
[08/12/2024-11:59:09] [I] Export timing to JSON file:
[08/12/2024-11:59:09] [I] Export output to JSON file:
[08/12/2024-11:59:09] [I] Export profile to JSON file:
[08/12/2024-11:59:09] [I]
[08/12/2024-11:59:09] [I] === Device Information ===
[08/12/2024-11:59:09] [I] Selected Device: Orin
[08/12/2024-11:59:09] [I] Compute Capability: 8.7
[08/12/2024-11:59:09] [I] SMs: 16
[08/12/2024-11:59:09] [I] Compute Clock Rate: 1.3 GHz
[08/12/2024-11:59:09] [I] Device Global Memory: 30592 MiB
[08/12/2024-11:59:09] [I] Shared Memory per SM: 164 KiB
[08/12/2024-11:59:09] [I] Memory Bus Width: 256 bits (ECC disabled)
[08/12/2024-11:59:09] [I] Memory Clock Rate: 1.3 GHz
[08/12/2024-11:59:09] [I]
[08/12/2024-11:59:09] [I] TensorRT version: 8.5.2
[08/12/2024-11:59:09] [I] [TRT] [MemUsageChange] Init CUDA: CPU +220, GPU +0, now: CPU 249, GPU 5508 (MiB)
[08/12/2024-11:59:11] [I] [TRT] [MemUsageChange] Init builder kernel library: CPU +302, GPU +297, now: CPU 574, GPU 5826 (MiB)
[08/12/2024-11:59:11] [I] Start parsing network model
[08/12/2024-11:59:11] [I] [TRT] ----------------------------------------------------------------
[08/12/2024-11:59:11] [I] [TRT] Input filename: m1_1408.onnx
[08/12/2024-11:59:11] [I] [TRT] ONNX IR version: 0.0.8
[08/12/2024-11:59:11] [I] [TRT] Opset version: 16
[08/12/2024-11:59:11] [I] [TRT] Producer name: pytorch
[08/12/2024-11:59:11] [I] [TRT] Producer version: 2.2.0
[08/12/2024-11:59:11] [I] [TRT] Domain:
[08/12/2024-11:59:11] [I] [TRT] Model version: 0
[08/12/2024-11:59:11] [I] [TRT] Doc string:
[08/12/2024-11:59:11] [I] [TRT] ----------------------------------------------------------------
[08/12/2024-11:59:11] [W] [TRT] onnx2trt_utils.cpp:375: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[08/12/2024-11:59:11] [W] [TRT] onnx2trt_utils.cpp:403: One or more weights outside the range of INT32 was clamped
[08/12/2024-11:59:11] [W] [TRT] Tensor DataType is determined at build time for tensors not marked as input or output.
[08/12/2024-11:59:11] [I] Finish parsing network model
[08/12/2024-11:59:11] [E] Static model does not take explicit shapes since the shape of inference tensors will be determined by the model itself
[08/12/2024-11:59:11] [E] Network And Config setup failed
[08/12/2024-11:59:11] [E] Building engine failed
[08/12/2024-11:59:11] [E] Failed to create engine from model or file.
[08/12/2024-11:59:11] [E] Engine set up failed
&&&& FAILED TensorRT.trtexec [TensorRT v8502] # /usr/src/tensorrt/bin/trtexec --onnx=m1_1408.onnx --int8 --fp16 --best --minShapes=input:1x3x640x640 --optShapes=input:1x3x640x640 --maxShapes=input:1x3x640x640 --useDLACore=1 --allowGPUFallback --saveEngine=./yolov8s_dla_b1_int8.engine

I am getting this error. Is there something different we need to do with the model creation/training?

raghavendra.ramya · August 12, 2024, 6:28pm

Is there a way to measure how much DLA I am using in terms of percentage? just like the CPU/GPU? Or am I thinking completely incorrectly about the DLA?

raghavendra.ramya · August 12, 2024, 7:05pm

I removed the ‘shapes’ from the commad and I see that it generated an engine. With shapes, are you telling trtexec to generate an engine file that expects an image of size 640*640 for inference?

Here is the log for reference if needed.
/usr/src/tensorrt/bin/trtexec --onnx=m1_1408.onnx --int8 --fp16 --best --useDLACore=1 --allowGPUFallback --saveEngine=./yolov8s_dla_b1_int8.engine
&&&& RUNNING TensorRT.trtexec [TensorRT v8502] # /usr/src/tensorrt/bin/trtexec --onnx=m1_1408.onnx --int8 --fp16 --best --useDLACore=1 --allowGPUFallback --saveEngine=./yolov8s_dla_b1_int8.engine
[08/12/2024-13:33:42] [I] === Model Options ===
[08/12/2024-13:33:42] [I] Format: ONNX
[08/12/2024-13:33:42] [I] Model: m1_1408.onnx
[08/12/2024-13:33:42] [I] Output:
[08/12/2024-13:33:42] [I] === Build Options ===
[08/12/2024-13:33:42] [I] Max batch: explicit batch
[08/12/2024-13:33:42] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default
[08/12/2024-13:33:42] [I] minTiming: 1
[08/12/2024-13:33:42] [I] avgTiming: 8
[08/12/2024-13:33:42] [I] Precision: FP32+FP16+INT8
[08/12/2024-13:33:42] [I] LayerPrecisions:
[08/12/2024-13:33:42] [I] Calibration: Dynamic
[08/12/2024-13:33:42] [I] Refit: Disabled
[08/12/2024-13:33:42] [I] Sparsity: Disabled
[08/12/2024-13:33:42] [I] Safe mode: Disabled
[08/12/2024-13:33:42] [I] DirectIO mode: Disabled
[08/12/2024-13:33:42] [I] Restricted mode: Disabled
[08/12/2024-13:33:42] [I] Build only: Disabled
[08/12/2024-13:33:42] [I] Save engine: ./yolov8s_dla_b1_int8.engine
[08/12/2024-13:33:42] [I] Load engine:
[08/12/2024-13:33:42] [I] Profiling verbosity: 0
[08/12/2024-13:33:42] [I] Tactic sources: Using default tactic sources
[08/12/2024-13:33:42] [I] timingCacheMode: local
[08/12/2024-13:33:42] [I] timingCacheFile:
[08/12/2024-13:33:42] [I] Heuristic: Disabled
[08/12/2024-13:33:42] [I] Preview Features: Use default preview flags.
[08/12/2024-13:33:42] [I] Input(s)s format: fp32:CHW
[08/12/2024-13:33:42] [I] Output(s)s format: fp32:CHW
[08/12/2024-13:33:42] [I] Input build shapes: model
[08/12/2024-13:33:42] [I] Input calibration shapes: model
[08/12/2024-13:33:42] [I] === System Options ===
[08/12/2024-13:33:42] [I] Device: 0
[08/12/2024-13:33:42] [I] DLACore: 1(With GPU fallback)
[08/12/2024-13:33:42] [I] Plugins:
[08/12/2024-13:33:42] [I] === Inference Options ===
[08/12/2024-13:33:42] [I] Batch: Explicit
[08/12/2024-13:33:42] [I] Input inference shapes: model
[08/12/2024-13:33:42] [I] Iterations: 10
[08/12/2024-13:33:42] [I] Duration: 3s (+ 200ms warm up)
[08/12/2024-13:33:42] [I] Sleep time: 0ms
[08/12/2024-13:33:42] [I] Idle time: 0ms
[08/12/2024-13:33:42] [I] Streams: 1
[08/12/2024-13:33:42] [I] ExposeDMA: Disabled
[08/12/2024-13:33:42] [I] Data transfers: Enabled
[08/12/2024-13:33:42] [I] Spin-wait: Disabled
[08/12/2024-13:33:42] [I] Multithreading: Disabled
[08/12/2024-13:33:42] [I] CUDA Graph: Disabled
[08/12/2024-13:33:42] [I] Separate profiling: Disabled
[08/12/2024-13:33:42] [I] Time Deserialize: Disabled
[08/12/2024-13:33:42] [I] Time Refit: Disabled
[08/12/2024-13:33:42] [I] NVTX verbosity: 0
[08/12/2024-13:33:42] [I] Persistent Cache Ratio: 0
[08/12/2024-13:33:42] [I] Inputs:
[08/12/2024-13:33:42] [I] === Reporting Options ===
[08/12/2024-13:33:42] [I] Verbose: Disabled
[08/12/2024-13:33:42] [I] Averages: 10 inferences
[08/12/2024-13:33:42] [I] Percentiles: 90,95,99
[08/12/2024-13:33:42] [I] Dump refittable layers:Disabled
[08/12/2024-13:33:42] [I] Dump output: Disabled
[08/12/2024-13:33:42] [I] Profile: Disabled
[08/12/2024-13:33:42] [I] Export timing to JSON file:
[08/12/2024-13:33:42] [I] Export output to JSON file:
[08/12/2024-13:33:42] [I] Export profile to JSON file:
[08/12/2024-13:33:42] [I]
[08/12/2024-13:33:42] [I] === Device Information ===
[08/12/2024-13:33:42] [I] Selected Device: Orin
[08/12/2024-13:33:42] [I] Compute Capability: 8.7
[08/12/2024-13:33:42] [I] SMs: 16
[08/12/2024-13:33:42] [I] Compute Clock Rate: 1.3 GHz
[08/12/2024-13:33:42] [I] Device Global Memory: 30592 MiB
[08/12/2024-13:33:42] [I] Shared Memory per SM: 164 KiB
[08/12/2024-13:33:42] [I] Memory Bus Width: 256 bits (ECC disabled)
[08/12/2024-13:33:42] [I] Memory Clock Rate: 1.3 GHz
[08/12/2024-13:33:42] [I]
[08/12/2024-13:33:42] [I] TensorRT version: 8.5.2
[08/12/2024-13:33:43] [I] [TRT] [MemUsageChange] Init CUDA: CPU +220, GPU +0, now: CPU 249, GPU 5475 (MiB)
[08/12/2024-13:33:44] [I] [TRT] [MemUsageChange] Init builder kernel library: CPU +302, GPU +292, now: CPU 574, GPU 5788 (MiB)
[08/12/2024-13:33:44] [I] Start parsing network model
[08/12/2024-13:33:44] [I] [TRT] ----------------------------------------------------------------
[08/12/2024-13:33:44] [I] [TRT] Input filename: m1_1408.onnx
[08/12/2024-13:33:44] [I] [TRT] ONNX IR version: 0.0.8
[08/12/2024-13:33:44] [I] [TRT] Opset version: 16
[08/12/2024-13:33:44] [I] [TRT] Producer name: pytorch
[08/12/2024-13:33:44] [I] [TRT] Producer version: 2.2.0
[08/12/2024-13:33:44] [I] [TRT] Domain:
[08/12/2024-13:33:44] [I] [TRT] Model version: 0
[08/12/2024-13:33:44] [I] [TRT] Doc string:
[08/12/2024-13:33:44] [I] [TRT] ----------------------------------------------------------------
[08/12/2024-13:33:44] [W] [TRT] onnx2trt_utils.cpp:375: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[08/12/2024-13:33:44] [W] [TRT] onnx2trt_utils.cpp:403: One or more weights outside the range of INT32 was clamped
[08/12/2024-13:33:44] [W] [TRT] Tensor DataType is determined at build time for tensors not marked as input or output.
[08/12/2024-13:33:44] [I] Finish parsing network model
[08/12/2024-13:33:44] [W] [TRT] Layer ‘/0/model.22/Reshape’ (SHUFFLE): Unsupported on DLA. Switching this layer’s device type to GPU.
[08/12/2024-13:33:44] [W] [TRT] Layer ‘/0/model.22/Reshape_1’ (SHUFFLE): Unsupported on DLA. Switching this layer’s device type to GPU.
[08/12/2024-13:33:44] [W] [TRT] Layer ‘/0/model.22/Reshape_2’ (SHUFFLE): Unsupported on DLA. Switching this layer’s device type to GPU.
[08/12/2024-13:33:44] [W] [TRT] /0/model.22/Concat_3: DLA only supports concatenation on the C dimension.
[08/12/2024-13:33:44] [W] [TRT] Layer ‘/0/model.22/Concat_3’ (CONCATENATION): Unsupported on DLA. Switching this layer’s device type to GPU.
[08/12/2024-13:33:44] [W] [TRT] Layer ‘/0/model.22/Constant_3_output_0’ (CONSTANT): Unsupported on DLA. Switching this layer’s device type to GPU.
[08/12/2024-13:33:44] [W] [TRT] /0/model.22/Split: DLA only supports slicing 4 dimensional tensors.
[08/12/2024-13:33:44] [W] [TRT] Layer ‘/0/model.22/Split’ (SLICE): Unsupported on DLA. Switching this layer’s device type to GPU.
[08/12/2024-13:33:44] [W] [TRT] /0/model.22/Split_15: DLA only supports slicing 4 dimensional tensors.
[08/12/2024-13:33:44] [W] [TRT] Layer ‘/0/model.22/Split_15’ (SLICE): Unsupported on DLA. Switching this layer’s device type to GPU.
[08/12/2024-13:33:44] [W] [TRT] /0/model.22/Split_16: DLA only supports slicing 4 dimensional tensors.
[08/12/2024-13:33:44] [W] [TRT] Layer ‘/0/model.22/Split_16’ (SLICE): Unsupported on DLA. Switching this layer’s device type to GPU.
[08/12/2024-13:33:44] [W] [TRT] Layer ‘/0/model.22/Squeeze’ (SHUFFLE): Unsupported on DLA. Switching this layer’s device type to GPU.
[08/12/2024-13:33:44] [W] [TRT] Layer ‘/0/model.22/Squeeze_1’ (SHUFFLE): Unsupported on DLA. Switching this layer’s device type to GPU.
[08/12/2024-13:33:44] [W] [TRT] Layer ‘/0/model.22/Squeeze_2’ (SHUFFLE): Unsupported on DLA. Switching this layer’s device type to GPU.
[08/12/2024-13:33:44] [W] [TRT] Layer ‘/0/model.22/Constant_9_output_0’ (CONSTANT): Unsupported on DLA. Switching this layer’s device type to GPU.
[08/12/2024-13:33:44] [W] [TRT] /0/model.22/Expand: DLA only supports slicing 4 dimensional tensors.
[08/12/2024-13:33:44] [W] [TRT] Layer ‘/0/model.22/Expand’ (SLICE): Unsupported on DLA. Switching this layer’s device type to GPU.
[08/12/2024-13:33:44] [W] [TRT] Layer ‘/0/model.22/Constant_10_output_0’ (CONSTANT): Unsupported on DLA. Switching this layer’s device type to GPU.
[08/12/2024-13:33:44] [W] [TRT] /0/model.22/Expand_1: DLA only supports slicing 4 dimensional tensors.
[08/12/2024-13:33:44] [W] [TRT] Layer ‘/0/model.22/Expand_1’ (SLICE): Unsupported on DLA. Switching this layer’s device type to GPU.
[08/12/2024-13:33:44] [W] [TRT] Layer ‘/0/model.22/Unsqueeze’ (SHUFFLE): Unsupported on DLA. Switching this layer’s device type to GPU.
[08/12/2024-13:33:44] [W] [TRT] Layer ‘/0/model.22/Unsqueeze_1’ (SHUFFLE): Unsupported on DLA. Switching this layer’s device type to GPU.
[08/12/2024-13:33:44] [W] [TRT] /0/model.22/Concat_4: DLA only supports concatenation on the C dimension.
[08/12/2024-13:33:44] [W] [TRT] Layer ‘/0/model.22/Concat_4’ (CONCATENATION): Unsupported on DLA. Switching this layer’s device type to GPU.
[08/12/2024-13:33:44] [W] [TRT] Layer ‘/0/model.22/Reshape_3’ (SHUFFLE): Unsupported on DLA. Switching this layer’s device type to GPU.
[08/12/2024-13:33:44] [W] [TRT] Layer ‘/0/model.22/Constant_14_output_0’ (CONSTANT): Unsupported on DLA. Switching this layer’s device type to GPU.
[08/12/2024-13:33:44] [W] [TRT] Layer ‘(Unnamed Layer* 240) [Constant]’ (CONSTANT): Unsupported on DLA. Switching this layer’s device type to GPU.
[08/12/2024-13:33:44] [W] [TRT] Layer ‘(Unnamed Layer* 241) [Shuffle]’ (SHUFFLE): Unsupported on DLA. Switching this layer’s device type to GPU.
[08/12/2024-13:33:44] [W] [TRT] /0/model.22/ConstantOfShape: DLA only supports slicing 4 dimensional tensors.
[08/12/2024-13:33:44] [W] [TRT] Layer ‘/0/model.22/ConstantOfShape’ (SLICE): Unsupported on DLA. Switching this layer’s device type to GPU.
[08/12/2024-13:33:44] [W] [TRT] Layer ‘(Unnamed Layer* 243) [Constant]’ (CONSTANT): Unsupported on DLA. Switching this layer’s device type to GPU.
[08/12/2024-13:33:44] [W] [TRT] Layer ‘(Unnamed Layer* 244) [Shuffle]’ (SHUFFLE): Unsupported on DLA. Switching this layer’s device type to GPU.
[08/12/2024-13:33:44] [W] [TRT] Layer ‘/0/model.22/Constant_16_output_0’ (CONSTANT): Unsupported on DLA. Switching this layer’s device type to GPU.
[08/12/2024-13:33:44] [W] [TRT] /0/model.22/Expand_2: DLA only supports slicing 4 dimensional tensors.
[08/12/2024-13:33:44] [W] [TRT] Layer ‘/0/model.22/Expand_2’ (SLICE): Unsupported on DLA. Switching this layer’s device type to GPU.
[08/12/2024-13:33:44] [W] [TRT] Layer ‘/0/model.22/Constant_17_output_0’ (CONSTANT): Unsupported on DLA. Switching this layer’s device type to GPU.
[08/12/2024-13:33:44] [W] [TRT] /0/model.22/Expand_3: DLA only supports slicing 4 dimensional tensors.
[08/12/2024-13:33:44] [W] [TRT] Layer ‘/0/model.22/Expand_3’ (SLICE): Unsupported on DLA. Switching this layer’s device type to GPU.
[08/12/2024-13:33:44] [W] [TRT] Layer ‘/0/model.22/Unsqueeze_2’ (SHUFFLE): Unsupported on DLA. Switching this layer’s device type to GPU.
[08/12/2024-13:33:44] [W] [TRT] Layer ‘/0/model.22/Unsqueeze_3’ (SHUFFLE): Unsupported on DLA. Switching this layer’s device type to GPU.
[08/12/2024-13:33:44] [W] [TRT] /0/model.22/Concat_5: DLA only supports concatenation on the C dimension.
[08/12/2024-13:33:44] [W] [TRT] Layer ‘/0/model.22/Concat_5’ (CONCATENATION): Unsupported on DLA. Switching this layer’s device type to GPU.
[08/12/2024-13:33:44] [W] [TRT] Layer ‘/0/model.22/Reshape_4’ (SHUFFLE): Unsupported on DLA. Switching this layer’s device type to GPU.
[08/12/2024-13:33:44] [W] [TRT] Layer ‘/0/model.22/Constant_21_output_0’ (CONSTANT): Unsupported on DLA. Switching this layer’s device type to GPU.
[08/12/2024-13:33:44] [W] [TRT] Layer ‘(Unnamed Layer* 255) [Constant]’ (CONSTANT): Unsupported on DLA. Switching this layer’s device type to GPU.
[08/12/2024-13:33:44] [W] [TRT] Layer ‘(Unnamed Layer* 256) [Shuffle]’ (SHUFFLE): Unsupported on DLA. Switching this layer’s device type to GPU.
[08/12/2024-13:33:44] [W] [TRT] /0/model.22/ConstantOfShape_1: DLA only supports slicing 4 dimensional tensors.
[08/12/2024-13:33:44] [W] [TRT] Layer ‘/0/model.22/ConstantOfShape_1’ (SLICE): Unsupported on DLA. Switching this layer’s device type to GPU.
[08/12/2024-13:33:44] [W] [TRT] Layer ‘(Unnamed Layer* 258) [Constant]’ (CONSTANT): Unsupported on DLA. Switching this layer’s device type to GPU.
[08/12/2024-13:33:44] [W] [TRT] Layer ‘(Unnamed Layer* 259) [Shuffle]’ (SHUFFLE): Unsupported on DLA. Switching this layer’s device type to GPU.
[08/12/2024-13:33:44] [W] [TRT] Layer ‘/0/model.22/Constant_23_output_0’ (CONSTANT): Unsupported on DLA. Switching this layer’s device type to GPU.
[08/12/2024-13:33:44] [W] [TRT] /0/model.22/Expand_4: DLA only supports slicing 4 dimensional tensors.
[08/12/2024-13:33:44] [W] [TRT] Layer ‘/0/model.22/Expand_4’ (SLICE): Unsupported on DLA. Switching this layer’s device type to GPU.
[08/12/2024-13:33:44] [W] [TRT] Layer ‘/0/model.22/Constant_24_output_0’ (CONSTANT): Unsupported on DLA. Switching this layer’s device type to GPU.
[08/12/2024-13:33:44] [W] [TRT] /0/model.22/Expand_5: DLA only supports slicing 4 dimensional tensors.
[08/12/2024-13:33:44] [W] [TRT] Layer ‘/0/model.22/Expand_5’ (SLICE): Unsupported on DLA. Switching this layer’s device type to GPU.
[08/12/2024-13:33:44] [W] [TRT] Layer ‘/0/model.22/Unsqueeze_4’ (SHUFFLE): Unsupported on DLA. Switching this layer’s device type to GPU.
[08/12/2024-13:33:44] [W] [TRT] Layer ‘/0/model.22/Unsqueeze_5’ (SHUFFLE): Unsupported on DLA. Switching this layer’s device type to GPU.
[08/12/2024-13:33:44] [W] [TRT] /0/model.22/Concat_6: DLA only supports concatenation on the C dimension.
[08/12/2024-13:33:44] [W] [TRT] Layer ‘/0/model.22/Concat_6’ (CONCATENATION): Unsupported on DLA. Switching this layer’s device type to GPU.
[08/12/2024-13:33:44] [W] [TRT] Layer ‘/0/model.22/Reshape_5’ (SHUFFLE): Unsupported on DLA. Switching this layer’s device type to GPU.
[08/12/2024-13:33:44] [W] [TRT] Layer ‘/0/model.22/Constant_28_output_0’ (CONSTANT): Unsupported on DLA. Switching this layer’s device type to GPU.
[08/12/2024-13:33:44] [W] [TRT] Layer ‘(Unnamed Layer* 270) [Constant]’ (CONSTANT): Unsupported on DLA. Switching this layer’s device type to GPU.
[08/12/2024-13:33:44] [W] [TRT] Layer ‘(Unnamed Layer* 271) [Shuffle]’ (SHUFFLE): Unsupported on DLA. Switching this layer’s device type to GPU.
[08/12/2024-13:33:44] [W] [TRT] /0/model.22/ConstantOfShape_2: DLA only supports slicing 4 dimensional tensors.
[08/12/2024-13:33:44] [W] [TRT] Layer ‘/0/model.22/ConstantOfShape_2’ (SLICE): Unsupported on DLA. Switching this layer’s device type to GPU.
[08/12/2024-13:33:44] [W] [TRT] Layer ‘(Unnamed Layer* 273) [Constant]’ (CONSTANT): Unsupported on DLA. Switching this layer’s device type to GPU.
[08/12/2024-13:33:44] [W] [TRT] Layer ‘(Unnamed Layer* 274) [Shuffle]’ (SHUFFLE): Unsupported on DLA. Switching this layer’s device type to GPU.
[08/12/2024-13:33:44] [W] [TRT] /0/model.22/Concat_7: DLA only supports concatenation on the C dimension.
[08/12/2024-13:33:44] [W] [TRT] Layer ‘/0/model.22/Concat_7’ (CONCATENATION): Unsupported on DLA. Switching this layer’s device type to GPU.
[08/12/2024-13:33:44] [W] [TRT] /0/model.22/Concat_8: DLA only supports concatenation on the C dimension.
[08/12/2024-13:33:44] [W] [TRT] Layer ‘/0/model.22/Concat_8’ (CONCATENATION): Unsupported on DLA. Switching this layer’s device type to GPU.
[08/12/2024-13:33:44] [W] [TRT] Layer ‘/0/model.22/Transpose’ (SHUFFLE): Unsupported on DLA. Switching this layer’s device type to GPU.
[08/12/2024-13:33:44] [W] [TRT] Layer ‘/0/model.22/Transpose_1’ (SHUFFLE): Unsupported on DLA. Switching this layer’s device type to GPU.
[08/12/2024-13:33:44] [W] [TRT] /0/model.22/Split_1: DLA only supports slicing 4 dimensional tensors.
[08/12/2024-13:33:44] [W] [TRT] Layer ‘/0/model.22/Split_1’ (SLICE): Unsupported on DLA. Switching this layer’s device type to GPU.
[08/12/2024-13:33:44] [W] [TRT] /0/model.22/Split_1_42: DLA only supports slicing 4 dimensional tensors.
[08/12/2024-13:33:44] [W] [TRT] Layer ‘/0/model.22/Split_1_42’ (SLICE): Unsupported on DLA. Switching this layer’s device type to GPU.
[08/12/2024-13:33:44] [W] [TRT] Layer ‘/0/model.22/dfl/Reshape’ (SHUFFLE): Unsupported on DLA. Switching this layer’s device type to GPU.
[08/12/2024-13:33:44] [W] [TRT] Layer ‘/0/model.22/dfl/Reshape_1’ (SHUFFLE): Unsupported on DLA. Switching this layer’s device type to GPU.
[08/12/2024-13:33:44] [W] [TRT] Layer ‘/0/model.22/Unsqueeze_6’ (SHUFFLE): Unsupported on DLA. Switching this layer’s device type to GPU.
[08/12/2024-13:33:44] [W] [TRT] Layer ‘/0/model.22/Shape’ (SHAPE): DLA only supports FP16 and Int8 precision type. Switching this layer’s device type to GPU.
[08/12/2024-13:33:44] [W] [TRT] Layer ‘/0/model.22/Constant_30_output_0’ (CONSTANT): Unsupported on DLA. Switching this layer’s device type to GPU.
[08/12/2024-13:33:44] [W] [TRT] Layer ‘/0/model.22/Gather’ (GATHER): Unsupported on DLA. Switching this layer’s device type to GPU.
[08/12/2024-13:33:44] [W] [TRT] Layer ‘/0/model.22/Constant_32_output_0’ (CONSTANT): Unsupported on DLA. Switching this layer’s device type to GPU.
[08/12/2024-13:33:44] [W] [TRT] Layer ‘/0/model.22/Constant_33_output_0’ (CONSTANT): Unsupported on DLA. Switching this layer’s device type to GPU.
[08/12/2024-13:33:44] [W] [TRT] /0/model.22/Div: DLA cores do not support DIV ElementWise operation.
[08/12/2024-13:33:44] [W] [TRT] Layer ‘/0/model.22/Div’ (ELEMENTWISE): Unsupported on DLA. Switching this layer’s device type to GPU.
[08/12/2024-13:33:44] [W] [TRT] Layer ‘/0/model.22/Constant_34_output_0’ (CONSTANT): Unsupported on DLA. Switching this layer’s device type to GPU.
[08/12/2024-13:33:44] [W] [TRT] Layer ‘(Unnamed Layer* 298) [Constant]’ (CONSTANT): Unsupported on DLA. Switching this layer’s device type to GPU.
[08/12/2024-13:33:44] [W] [TRT] (Unnamed Layer* 299) [Concatenation]: DLA only supports concatenation on the C dimension.
[08/12/2024-13:33:44] [W] [TRT] Layer ‘(Unnamed Layer* 299) [Concatenation]’ (CONCATENATION): Unsupported on DLA. Switching this layer’s device type to GPU.
[08/12/2024-13:33:44] [W] [TRT] Layer ‘(Unnamed Layer* 300) [Constant]’ (CONSTANT): Unsupported on DLA. Switching this layer’s device type to GPU.
[08/12/2024-13:33:44] [W] [TRT] Layer ‘(Unnamed Layer* 301) [Gather]’ (GATHER): Unsupported on DLA. Switching this layer’s device type to GPU.
[08/12/2024-13:33:44] [W] [TRT] Layer ‘(Unnamed Layer* 302) [Constant]’ (CONSTANT): Unsupported on DLA. Switching this layer’s device type to GPU.
[08/12/2024-13:33:44] [W] [TRT] Layer ‘(Unnamed Layer* 304) [Constant]’ (CONSTANT): Unsupported on DLA. Switching this layer’s device type to GPU.
[08/12/2024-13:33:44] [W] [TRT] Layer ‘(Unnamed Layer* 308) [Constant]’ (CONSTANT): Unsupported on DLA. Switching this layer’s device type to GPU.
[08/12/2024-13:33:44] [W] [TRT] Layer ‘(Unnamed Layer* 311) [Constant]’ (CONSTANT): Unsupported on DLA. Switching this layer’s device type to GPU.
[08/12/2024-13:33:44] [W] [TRT] /0/model.22/Slice: DLA only supports slicing 4 dimensional tensors.
[08/12/2024-13:33:44] [W] [TRT] Layer ‘/0/model.22/Slice’ (SLICE): Unsupported on DLA. Switching this layer’s device type to GPU.
[08/12/2024-13:33:44] [W] [TRT] Layer ‘/0/model.22/Constant_35_output_0’ (CONSTANT): Unsupported on DLA. Switching this layer’s device type to GPU.
[08/12/2024-13:33:44] [W] [TRT] Layer ‘(Unnamed Layer* 316) [Constant]’ (CONSTANT): Unsupported on DLA. Switching this layer’s device type to GPU.
[08/12/2024-13:33:44] [W] [TRT] (Unnamed Layer* 317) [Concatenation]: DLA only supports concatenation on the C dimension.
[08/12/2024-13:33:44] [W] [TRT] Layer ‘(Unnamed Layer* 317) [Concatenation]’ (CONCATENATION): Unsupported on DLA. Switching this layer’s device type to GPU.
[08/12/2024-13:33:44] [W] [TRT] Layer ‘(Unnamed Layer* 318) [Constant]’ (CONSTANT): Unsupported on DLA. Switching this layer’s device type to GPU.
[08/12/2024-13:33:44] [W] [TRT] Layer ‘(Unnamed Layer* 319) [Gather]’ (GATHER): Unsupported on DLA. Switching this layer’s device type to GPU.
[08/12/2024-13:33:44] [W] [TRT] Layer ‘(Unnamed Layer* 320) [Constant]’ (CONSTANT): Unsupported on DLA. Switching this layer’s device type to GPU.
[08/12/2024-13:33:44] [W] [TRT] (Unnamed Layer* 321) [Concatenation]: DLA only supports concatenation on the C dimension.
[08/12/2024-13:33:44] [W] [TRT] Layer ‘(Unnamed Layer* 321) [Concatenation]’ (CONCATENATION): Unsupported on DLA. Switching this layer’s device type to GPU.
[08/12/2024-13:33:44] [W] [TRT] Layer ‘(Unnamed Layer* 322) [Gather]’ (GATHER): Unsupported on DLA. Switching this layer’s device type to GPU.
[08/12/2024-13:33:44] [W] [TRT] Layer ‘(Unnamed Layer* 323) [Constant]’ (CONSTANT): Unsupported on DLA. Switching this layer’s device type to GPU.
[08/12/2024-13:33:44] [W] [TRT] Layer ‘(Unnamed Layer* 325) [Constant]’ (CONSTANT): Unsupported on DLA. Switching this layer’s device type to GPU.
[08/12/2024-13:33:44] [W] [TRT] Layer ‘(Unnamed Layer* 329) [Constant]’ (CONSTANT): Unsupported on DLA. Switching this layer’s device type to GPU.
[08/12/2024-13:33:44] [W] [TRT] Layer ‘(Unnamed Layer* 332) [Constant]’ (CONSTANT): Unsupported on DLA. Switching this layer’s device type to GPU.
[08/12/2024-13:33:44] [W] [TRT] Layer ‘(Unnamed Layer* 334) [Constant]’ (CONSTANT): Unsupported on DLA. Switching this layer’s device type to GPU.
[08/12/2024-13:33:44] [W] [TRT] Layer ‘(Unnamed Layer* 338) [Constant]’ (CONSTANT): Unsupported on DLA. Switching this layer’s device type to GPU.
[08/12/2024-13:33:44] [W] [TRT] Layer ‘(Unnamed Layer* 342) [Constant]’ (CONSTANT): Unsupported on DLA. Switching this layer’s device type to GPU.
[08/12/2024-13:33:44] [W] [TRT] /0/model.22/Slice_1: DLA only supports slicing 4 dimensional tensors.
[08/12/2024-13:33:44] [W] [TRT] Layer ‘/0/model.22/Slice_1’ (SLICE): Unsupported on DLA. Switching this layer’s device type to GPU.
[08/12/2024-13:33:44] [W] [TRT] Layer ‘/0/model.22/Constant_36_output_0’ (CONSTANT): Unsupported on DLA. Switching this layer’s device type to GPU.
[08/12/2024-13:33:44] [W] [TRT] Layer ‘(Unnamed Layer* 349) [Shuffle]’ (SHUFFLE): Unsupported on DLA. Switching this layer’s device type to GPU.
[08/12/2024-13:33:44] [W] [TRT] /0/model.22/Div_1: DLA cores do not support DIV ElementWise operation.
[08/12/2024-13:33:44] [W] [TRT] Layer ‘/0/model.22/Div_1’ (ELEMENTWISE): Unsupported on DLA. Switching this layer’s device type to GPU.
[08/12/2024-13:33:44] [W] [TRT] /0/model.22/Concat_9: DLA only supports concatenation on the C dimension.
[08/12/2024-13:33:44] [W] [TRT] Layer ‘/0/model.22/Concat_9’ (CONCATENATION): Unsupported on DLA. Switching this layer’s device type to GPU.
[08/12/2024-13:33:44] [W] [TRT] Layer ‘(Unnamed Layer* 353) [Shuffle]’ (SHUFFLE): Unsupported on DLA. Switching this layer’s device type to GPU.
[08/12/2024-13:33:44] [W] [TRT] /0/model.22/Concat_10: DLA only supports concatenation on the C dimension.
[08/12/2024-13:33:44] [W] [TRT] Layer ‘/0/model.22/Concat_10’ (CONCATENATION): Unsupported on DLA. Switching this layer’s device type to GPU.
[08/12/2024-13:33:44] [W] [TRT] Layer ‘/1/Transpose’ (SHUFFLE): Unsupported on DLA. Switching this layer’s device type to GPU.
[08/12/2024-13:33:44] [W] [TRT] /1/Slice: DLA only supports slicing 4 dimensional tensors.
[08/12/2024-13:33:44] [W] [TRT] Layer ‘/1/Slice’ (SLICE): Unsupported on DLA. Switching this layer’s device type to GPU.
[08/12/2024-13:33:44] [W] [TRT] /1/Slice_1: DLA only supports slicing 4 dimensional tensors.
[08/12/2024-13:33:44] [W] [TRT] Layer ‘/1/Slice_1’ (SLICE): Unsupported on DLA. Switching this layer’s device type to GPU.
[08/12/2024-13:33:44] [W] [TRT] Layer ‘/1/ReduceMax’ (REDUCE): Unsupported on DLA. Switching this layer’s device type to GPU.
[08/12/2024-13:33:44] [W] [TRT] Layer ‘/1/ArgMax’ (TOPK): Unsupported on DLA. Switching this layer’s device type to GPU.
[08/12/2024-13:33:44] [W] [TRT] Layer ‘/1/Cast’ (CAST): Unsupported on DLA. Switching this layer’s device type to GPU.
[08/12/2024-13:33:44] [W] [TRT] Calibrator is not being used. Users must provide dynamic range for all tensors that are not Int32 or Bool.
[08/12/2024-13:33:45] [W] [TRT] Dimension: 3 (14784) exceeds maximum allowed size for DLA: 8192
[08/12/2024-13:33:45] [W] [TRT] Validation failed for DLA layer: /0/model.22/dfl/Softmax. Switching to GPU fallback.
[08/12/2024-13:33:45] [W] [TRT] Dimension: 3 (14784) exceeds maximum allowed size for DLA: 8192
[08/12/2024-13:33:45] [W] [TRT] Validation failed for DLA layer: /0/model.22/dfl/conv/Conv. Switching to GPU fallback.
[08/12/2024-13:33:45] [W] [TRT] Input tensor has less than 4 dimensions for /0/model.22/Add. At least one shuffle layer will be inserted which cannot run on DLA.
[08/12/2024-13:33:45] [W] [TRT] Batch size (11264) exceeds maximum allowed size for DLA: 4096
[08/12/2024-13:33:45] [W] [TRT] Validation failed for DLA layer: /0/model.22/Add. Switching to GPU fallback.
[08/12/2024-13:33:45] [W] [TRT] Input tensor has less than 4 dimensions for /0/model.22/Add_1. At least one shuffle layer will be inserted which cannot run on DLA.
[08/12/2024-13:33:45] [W] [TRT] DLA only allows inputs of the same dimensions to Elementwise, but input shapes were: [2816,1] and [1,1]
[08/12/2024-13:33:45] [W] [TRT] Validation failed for DLA layer: /0/model.22/Add_1. Switching to GPU fallback.
[08/12/2024-13:33:45] [W] [TRT] Input tensor has less than 4 dimensions for /0/model.22/Add_2. At least one shuffle layer will be inserted which cannot run on DLA.
[08/12/2024-13:33:45] [W] [TRT] DLA only allows inputs of the same dimensions to Elementwise, but input shapes were: [704,1] and [1,1]
[08/12/2024-13:33:45] [W] [TRT] Validation failed for DLA layer: /0/model.22/Add_2. Switching to GPU fallback.
[08/12/2024-13:33:45] [W] [TRT] Input tensor has less than 4 dimensions for /0/model.22/Sub. At least one shuffle layer will be inserted which cannot run on DLA.
[08/12/2024-13:33:45] [W] [TRT] Dimension: 3 (14784) exceeds maximum allowed size for DLA: 8192
[08/12/2024-13:33:45] [W] [TRT] Validation failed for DLA layer: /0/model.22/Sub. Switching to GPU fallback.
[08/12/2024-13:33:45] [W] [TRT] Input tensor has less than 4 dimensions for /0/model.22/Add_4. At least one shuffle layer will be inserted which cannot run on DLA.
[08/12/2024-13:33:45] [W] [TRT] Dimension: 3 (14784) exceeds maximum allowed size for DLA: 8192
[08/12/2024-13:33:45] [W] [TRT] Validation failed for DLA layer: /0/model.22/Add_4. Switching to GPU fallback.
[08/12/2024-13:33:45] [W] [TRT] Input tensor has less than 4 dimensions for /0/model.22/Add_5. At least one shuffle layer will be inserted which cannot run on DLA.
[08/12/2024-13:33:45] [W] [TRT] Dimension: 3 (14784) exceeds maximum allowed size for DLA: 8192
[08/12/2024-13:33:45] [W] [TRT] Validation failed for DLA layer: /0/model.22/Add_5. Switching to GPU fallback.
[08/12/2024-13:33:45] [W] [TRT] Input tensor has less than 4 dimensions for /0/model.22/Sub_1. At least one shuffle layer will be inserted which cannot run on DLA.
[08/12/2024-13:33:45] [W] [TRT] Dimension: 3 (14784) exceeds maximum allowed size for DLA: 8192
[08/12/2024-13:33:45] [W] [TRT] Validation failed for DLA layer: /0/model.22/Sub_1. Switching to GPU fallback.
[08/12/2024-13:33:45] [W] [TRT] Input tensor has less than 4 dimensions for /0/model.22/Mul_2. At least one shuffle layer will be inserted which cannot run on DLA.
[08/12/2024-13:33:45] [W] [TRT] Dimension: 3 (14784) exceeds maximum allowed size for DLA: 8192
[08/12/2024-13:33:45] [W] [TRT] Validation failed for DLA layer: /0/model.22/Mul_2. Switching to GPU fallback.
[08/12/2024-13:33:45] [W] [TRT] Input tensor has less than 4 dimensions for /0/model.22/Sigmoid. At least one shuffle layer will be inserted which cannot run on DLA.
[08/12/2024-13:33:45] [W] [TRT] Dimension: 3 (14784) exceeds maximum allowed size for DLA: 8192
[08/12/2024-13:33:45] [W] [TRT] Validation failed for DLA layer: /0/model.22/Sigmoid. Switching to GPU fallback.
[08/12/2024-13:33:53] [I] [TRT] ---------- Layers Running on DLA ----------
[08/12/2024-13:33:53] [I] [TRT] [DlaLayer] {ForeignNode[/0/model.0/conv/Conv…/0/model.22/Concat_2]}
[08/12/2024-13:33:53] [I] [TRT] ---------- Layers Running on GPU ----------
[08/12/2024-13:33:53] [I] [TRT] [GpuLayer] SHUFFLE: /0/model.22/Reshape
[08/12/2024-13:33:53] [I] [TRT] [GpuLayer] COPY: /0/model.22/Reshape_copy_output
[08/12/2024-13:33:53] [I] [TRT] [GpuLayer] SHUFFLE: /0/model.22/Reshape_1
[08/12/2024-13:33:53] [I] [TRT] [GpuLayer] COPY: /0/model.22/Reshape_1_copy_output
[08/12/2024-13:33:53] [I] [TRT] [GpuLayer] SHUFFLE: /0/model.22/Reshape_2
[08/12/2024-13:33:53] [I] [TRT] [GpuLayer] COPY: /0/model.22/Reshape_2_copy_output
[08/12/2024-13:33:53] [I] [TRT] [GpuLayer] SHUFFLE: /0/model.22/dfl/Reshape + /0/model.22/dfl/Transpose
[08/12/2024-13:33:53] [I] [TRT] [GpuLayer] SOFTMAX: /0/model.22/dfl/Softmax
[08/12/2024-13:33:53] [I] [TRT] [GpuLayer] CONVOLUTION: /0/model.22/dfl/conv/Conv
[08/12/2024-13:33:53] [I] [TRT] [GpuLayer] CONSTANT: /0/model.22/Constant_3_output_0
[08/12/2024-13:33:53] [I] [TRT] [GpuLayer] CONSTANT: /0/model.22/Constant_3_output_0_clone_1
[08/12/2024-13:33:53] [I] [TRT] [GpuLayer] CONSTANT: /0/model.22/Constant_3_output_0_clone_2
[08/12/2024-13:33:53] [I] [TRT] [GpuLayer] CONSTANT: /0/model.22/Constant_9_output_0
[08/12/2024-13:33:53] [I] [TRT] [GpuLayer] CONSTANT: /0/model.22/Constant_10_output_0
[08/12/2024-13:33:53] [I] [TRT] [GpuLayer] CONSTANT: (Unnamed Layer* 240) [Constant] + (Unnamed Layer* 241) [Shuffle]
[08/12/2024-13:33:53] [I] [TRT] [GpuLayer] CONSTANT: /0/model.22/Constant_16_output_0
[08/12/2024-13:33:53] [I] [TRT] [GpuLayer] CONSTANT: /0/model.22/Constant_17_output_0
[08/12/2024-13:33:53] [I] [TRT] [GpuLayer] CONSTANT: (Unnamed Layer* 255) [Constant] + (Unnamed Layer* 256) [Shuffle]
[08/12/2024-13:33:53] [I] [TRT] [GpuLayer] CONSTANT: /0/model.22/Constant_23_output_0
[08/12/2024-13:33:53] [I] [TRT] [GpuLayer] CONSTANT: /0/model.22/Constant_24_output_0
[08/12/2024-13:33:53] [I] [TRT] [GpuLayer] CONSTANT: (Unnamed Layer* 270) [Constant] + (Unnamed Layer* 271) [Shuffle]
[08/12/2024-13:33:53] [I] [TRT] [GpuLayer] SLICE: /0/model.22/Expand
[08/12/2024-13:33:53] [I] [TRT] [GpuLayer] SLICE: /0/model.22/Expand_1
[08/12/2024-13:33:53] [I] [TRT] [GpuLayer] SLICE: /0/model.22/Expand_2
[08/12/2024-13:33:53] [I] [TRT] [GpuLayer] SLICE: /0/model.22/Expand_3
[08/12/2024-13:33:53] [I] [TRT] [GpuLayer] SLICE: /0/model.22/Expand_4
[08/12/2024-13:33:53] [I] [TRT] [GpuLayer] SLICE: /0/model.22/Expand_5
[08/12/2024-13:33:53] [I] [TRT] [GpuLayer] SHUFFLE: /0/model.22/Unsqueeze_1
[08/12/2024-13:33:53] [I] [TRT] [GpuLayer] COPY: /0/model.22/Unsqueeze_1_copy_output
[08/12/2024-13:33:53] [I] [TRT] [GpuLayer] SHUFFLE: /0/model.22/Unsqueeze
[08/12/2024-13:33:53] [I] [TRT] [GpuLayer] SLICE: /0/model.22/ConstantOfShape
[08/12/2024-13:33:53] [I] [TRT] [GpuLayer] SHUFFLE: /0/model.22/Unsqueeze_3
[08/12/2024-13:33:53] [I] [TRT] [GpuLayer] COPY: /0/model.22/Unsqueeze_3_copy_output
[08/12/2024-13:33:53] [I] [TRT] [GpuLayer] SHUFFLE: /0/model.22/Unsqueeze_2
[08/12/2024-13:33:53] [I] [TRT] [GpuLayer] SLICE: /0/model.22/ConstantOfShape_1
[08/12/2024-13:33:53] [I] [TRT] [GpuLayer] SHUFFLE: /0/model.22/Unsqueeze_5
[08/12/2024-13:33:53] [I] [TRT] [GpuLayer] COPY: /0/model.22/Unsqueeze_5_copy_output
[08/12/2024-13:33:53] [I] [TRT] [GpuLayer] SHUFFLE: /0/model.22/Unsqueeze_4
[08/12/2024-13:33:53] [I] [TRT] [GpuLayer] SLICE: /0/model.22/ConstantOfShape_2
[08/12/2024-13:33:53] [I] [TRT] [GpuLayer] COPY: /0/model.22/Squeeze + (Unnamed Layer* 244) [Shuffle]_copy_input
[08/12/2024-13:33:53] [I] [TRT] [GpuLayer] SHUFFLE: /0/model.22/Squeeze + (Unnamed Layer* 244) [Shuffle]
[08/12/2024-13:33:53] [I] [TRT] [GpuLayer] COPY: /0/model.22/Squeeze_1 + (Unnamed Layer* 259) [Shuffle]_copy_input
[08/12/2024-13:33:53] [I] [TRT] [GpuLayer] SHUFFLE: /0/model.22/Squeeze_1 + (Unnamed Layer* 259) [Shuffle]
[08/12/2024-13:33:53] [I] [TRT] [GpuLayer] COPY: /0/model.22/Squeeze_2 + (Unnamed Layer* 274) [Shuffle]_copy_input
[08/12/2024-13:33:53] [I] [TRT] [GpuLayer] SHUFFLE: /0/model.22/Squeeze_2 + (Unnamed Layer* 274) [Shuffle]
[08/12/2024-13:33:53] [I] [TRT] [GpuLayer] COPY: /0/model.22/Unsqueeze_output_0 copy
[08/12/2024-13:33:53] [I] [TRT] [GpuLayer] COPY: /0/model.22/Unsqueeze_2_output_0 copy
[08/12/2024-13:33:53] [I] [TRT] [GpuLayer] COPY: /0/model.22/Unsqueeze_4_output_0 copy
[08/12/2024-13:33:53] [I] [TRT] [GpuLayer] SHUFFLE: /0/model.22/Reshape_3
[08/12/2024-13:33:53] [I] [TRT] [GpuLayer] COPY: /0/model.22/Reshape_3_copy_output
[08/12/2024-13:33:53] [I] [TRT] [GpuLayer] SHUFFLE: /0/model.22/Reshape_4
[08/12/2024-13:33:53] [I] [TRT] [GpuLayer] COPY: /0/model.22/Reshape_4_copy_output
[08/12/2024-13:33:53] [I] [TRT] [GpuLayer] SHUFFLE: /0/model.22/Reshape_5
[08/12/2024-13:33:53] [I] [TRT] [GpuLayer] COPY: /0/model.22/Reshape_5_copy_output
[08/12/2024-13:33:53] [I] [TRT] [GpuLayer] SHUFFLE: /0/model.22/Transpose + /0/model.22/Unsqueeze_6
[08/12/2024-13:33:53] [I] [TRT] [GpuLayer] SHUFFLE: /0/model.22/dfl/Reshape_1
[08/12/2024-13:33:53] [I] [TRT] [GpuLayer] POINTWISE: PWN(/0/model.22/Add)
[08/12/2024-13:33:53] [I] [TRT] [GpuLayer] POINTWISE: PWN(/0/model.22/Add_1)
[08/12/2024-13:33:53] [I] [TRT] [GpuLayer] POINTWISE: PWN(/0/model.22/Add_2)
[08/12/2024-13:33:53] [I] [TRT] [GpuLayer] ELEMENTWISE: /0/model.22/Sub
[08/12/2024-13:33:53] [I] [TRT] [GpuLayer] POINTWISE: PWN(/0/model.22/Add_4)
[08/12/2024-13:33:53] [I] [TRT] [GpuLayer] ELEMENTWISE: /0/model.22/Sub_1
[08/12/2024-13:33:53] [I] [TRT] [GpuLayer] POINTWISE: PWN(/0/model.22/Constant_36_output_0 + (Unnamed Layer* 349) [Shuffle], PWN(/0/model.22/Add_5, /0/model.22/Div_1))
[08/12/2024-13:33:53] [I] [TRT] [GpuLayer] COPY: /0/model.22/Div_1_output_0 copy
[08/12/2024-13:33:53] [I] [TRT] [GpuLayer] SHUFFLE: /0/model.22/Transpose_1 + (Unnamed Layer* 353) [Shuffle]
[08/12/2024-13:33:53] [I] [TRT] [GpuLayer] ELEMENTWISE: /0/model.22/Mul_2
[08/12/2024-13:33:53] [I] [TRT] [GpuLayer] POINTWISE: PWN(/0/model.22/Sigmoid)
[08/12/2024-13:33:53] [I] [TRT] [GpuLayer] SHUFFLE: /1/Transpose
[08/12/2024-13:33:53] [I] [TRT] [GpuLayer] COPY: /1/Slice
[08/12/2024-13:33:53] [I] [TRT] [GpuLayer] COPY: /1/Slice_1
[08/12/2024-13:33:53] [I] [TRT] [GpuLayer] REDUCE: /1/ReduceMax
[08/12/2024-13:33:53] [I] [TRT] [GpuLayer] TOPK: /1/ArgMax
[08/12/2024-13:33:53] [I] [TRT] [GpuLayer] CAST: /1/Cast
[08/12/2024-13:33:54] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +534, GPU +756, now: CPU 1153, GPU 6667 (MiB)
[08/12/2024-13:33:54] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +83, GPU +123, now: CPU 1236, GPU 6790 (MiB)
[08/12/2024-13:33:54] [I] [TRT] Local timing cache in use. Profiling results in this builder pass will not be stored.

[08/12/2024-13:37:20] [W] [TRT] No implementation of layer /0/model.22/Mul_2 obeys the requested constraints. I.e. no conforming implementation was found for requested layer computation precision and output precision. Using fastest implementation instead.
[08/12/2024-13:37:48] [W] [TRT] No implementation of layer PWN(/0/model.22/Sigmoid) obeys the requested constraints. I.e. no conforming implementation was found for requested layer computation precision and output precision. Using fastest implementation instead.
[08/12/2024-13:37:48] [I] [TRT] Total Activation Memory: 32096583680
[08/12/2024-13:37:48] [I] [TRT] Detected 1 inputs and 3 output network tensors.
[08/12/2024-13:37:50] [I] [TRT] Total Host Persistent Memory: 4944
[08/12/2024-13:37:50] [I] [TRT] Total Device Persistent Memory: 0
[08/12/2024-13:37:50] [I] [TRT] Total Scratch Memory: 473088
[08/12/2024-13:37:50] [I] [TRT] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 12 MiB, GPU 238 MiB
[08/12/2024-13:37:50] [I] [TRT] [BlockAssignment] Started assigning block shifts. This will take 64 steps to complete.
[08/12/2024-13:37:50] [I] [TRT] [BlockAssignment] Algorithm ShiftNTopDown took 8.22859ms to assign 20 blocks to 64 nodes requiring 7042048 bytes.
[08/12/2024-13:37:50] [I] [TRT] Total Activation Memory: 7042048
[08/12/2024-13:37:50] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +12, GPU +4, now: CPU 12, GPU 4 (MiB)
[08/12/2024-13:37:50] [I] Engine built in 247.673 sec.
[08/12/2024-13:37:50] [I] [TRT] Loaded engine size: 12 MiB
[08/12/2024-13:37:50] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +12, GPU +0, now: CPU 12, GPU 0 (MiB)
[08/12/2024-13:37:50] [I] Engine deserialized in 0.0105551 sec.
[08/12/2024-13:37:50] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +6, now: CPU 12, GPU 6 (MiB)
[08/12/2024-13:37:50] [I] Setting persistentCacheLimit to 0 bytes.
[08/12/2024-13:37:50] [I] Using random values for input input
[08/12/2024-13:37:50] [I] Created input binding for input with dimensions 1x3x512x1408
[08/12/2024-13:37:50] [I] Using random values for output boxes
[08/12/2024-13:37:50] [I] Created output binding for boxes with dimensions 1x14784x4
[08/12/2024-13:37:50] [I] Using random values for output scores
[08/12/2024-13:37:50] [I] Created output binding for scores with dimensions 1x14784x1
[08/12/2024-13:37:50] [I] Using random values for output classes
[08/12/2024-13:37:50] [I] Created output binding for classes with dimensions 1x14784x1
[08/12/2024-13:37:50] [I] Starting inference
[08/12/2024-13:37:54] [I] Warmup completed 10 queries over 200 ms
[08/12/2024-13:37:54] [I] Timing trace has 146 queries over 3.07009 s
[08/12/2024-13:37:54] [I]
[08/12/2024-13:37:54] [I] === Trace details ===
[08/12/2024-13:37:54] [I] Trace averages of 10 runs:
[08/12/2024-13:37:54] [I] Average on 10 runs - GPU latency: 20.8657 ms - Host latency: 21.1555 ms (enqueue 0.438264 ms)
[08/12/2024-13:37:54] [I] Average on 10 runs - GPU latency: 20.869 ms - Host latency: 21.1609 ms (enqueue 0.423022 ms)
[08/12/2024-13:37:54] [I] Average on 10 runs - GPU latency: 20.9894 ms - Host latency: 21.2852 ms (enqueue 0.434546 ms)
[08/12/2024-13:37:54] [I] Average on 10 runs - GPU latency: 20.8723 ms - Host latency: 21.1626 ms (enqueue 0.413574 ms)
[08/12/2024-13:37:54] [I] Average on 10 runs - GPU latency: 20.8717 ms - Host latency: 21.1629 ms (enqueue 0.418066 ms)
[08/12/2024-13:37:54] [I] Average on 10 runs - GPU latency: 20.9067 ms - Host latency: 21.1983 ms (enqueue 0.433289 ms)
[08/12/2024-13:37:54] [I] Average on 10 runs - GPU latency: 20.8713 ms - Host latency: 21.1621 ms (enqueue 0.418494 ms)
[08/12/2024-13:37:54] [I] Average on 10 runs - GPU latency: 20.8685 ms - Host latency: 21.159 ms (enqueue 0.422376 ms)
[08/12/2024-13:37:54] [I] Average on 10 runs - GPU latency: 20.8983 ms - Host latency: 21.1898 ms (enqueue 0.430933 ms)
[08/12/2024-13:37:54] [I] Average on 10 runs - GPU latency: 20.8642 ms - Host latency: 21.1535 ms (enqueue 0.412622 ms)
[08/12/2024-13:37:54] [I] Average on 10 runs - GPU latency: 20.8666 ms - Host latency: 21.1576 ms (enqueue 0.416284 ms)
[08/12/2024-13:37:54] [I] Average on 10 runs - GPU latency: 20.889 ms - Host latency: 21.1797 ms (enqueue 0.436084 ms)
[08/12/2024-13:37:54] [I] Average on 10 runs - GPU latency: 20.8679 ms - Host latency: 21.1585 ms (enqueue 0.4177 ms)
[08/12/2024-13:37:54] [I] Average on 10 runs - GPU latency: 20.8716 ms - Host latency: 21.1619 ms (enqueue 0.417725 ms)
[08/12/2024-13:37:54] [I]
[08/12/2024-13:37:54] [I] === Performance summary ===
[08/12/2024-13:37:54] [I] Throughput: 47.5555 qps
[08/12/2024-13:37:54] [I] Latency: min = 21.1366 ms, max = 22.2006 ms, mean = 21.1756 ms, median = 21.1614 ms, percentile(90%) = 21.1743 ms, percentile(95%) = 21.3097 ms, percentile(99%) = 21.3342 ms
[08/12/2024-13:37:54] [I] Enqueue Time: min = 0.404785 ms, max = 0.547607 ms, mean = 0.425289 ms, median = 0.417664 ms, percentile(90%) = 0.451172 ms, percentile(95%) = 0.476929 ms, percentile(99%) = 0.539062 ms
[08/12/2024-13:37:54] [I] H2D Latency: min = 0.261719 ms, max = 0.319092 ms, mean = 0.265462 ms, median = 0.264687 ms, percentile(90%) = 0.26709 ms, percentile(95%) = 0.268311 ms, percentile(99%) = 0.287109 ms
[08/12/2024-13:37:54] [I] GPU Compute Time: min = 20.8446 ms, max = 21.9099 ms, mean = 20.8844 ms, median = 20.8706 ms, percentile(90%) = 20.8811 ms, percentile(95%) = 21.0193 ms, percentile(99%) = 21.0426 ms
[08/12/2024-13:37:54] [I] D2H Latency: min = 0.0126953 ms, max = 0.0283203 ms, mean = 0.0256893 ms, median = 0.0256348 ms, percentile(90%) = 0.0271301 ms, percentile(95%) = 0.0275269 ms, percentile(99%) = 0.0282593 ms
[08/12/2024-13:37:54] [I] Total Host Walltime: 3.07009 s
[08/12/2024-13:37:54] [I] Total GPU Compute Time: 3.04912 s
[08/12/2024-13:37:54] [I] Explanations of the performance metrics are printed in the verbose logs.
[08/12/2024-13:37:54] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v8502] # /usr/src/tensorrt/bin/trtexec --onnx=m1_1408.onnx --int8 --fp16 --best --useDLACore=1 --allowGPUFallback --saveEngine=./yolov8s_dla_b1_int8.engine

Then I ran the command to load the engine and get the dumpProfile.

/usr/src/tensorrt/bin/trtexec --loadEngine=yolov8s_dla_b1_int8.engine --useDLACore=1 --dumpProfile
&&&& RUNNING TensorRT.trtexec [TensorRT v8502] # /usr/src/tensorrt/bin/trtexec --loadEngine=yolov8s_dla_b1_int8.engine --useDLACore=1 --dumpProfile
[08/12/2024-14:04:42] [I] === Model Options ===
[08/12/2024-14:04:42] [I] Format: *
[08/12/2024-14:04:42] [I] Model:
[08/12/2024-14:04:42] [I] Output:
[08/12/2024-14:04:42] [I] === Build Options ===
[08/12/2024-14:04:42] [I] Max batch: 1
[08/12/2024-14:04:42] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default
[08/12/2024-14:04:42] [I] minTiming: 1
[08/12/2024-14:04:42] [I] avgTiming: 8
[08/12/2024-14:04:42] [I] Precision: FP32
[08/12/2024-14:04:42] [I] LayerPrecisions:
[08/12/2024-14:04:42] [I] Calibration:
[08/12/2024-14:04:42] [I] Refit: Disabled
[08/12/2024-14:04:42] [I] Sparsity: Disabled
[08/12/2024-14:04:42] [I] Safe mode: Disabled
[08/12/2024-14:04:42] [I] DirectIO mode: Disabled
[08/12/2024-14:04:42] [I] Restricted mode: Disabled
[08/12/2024-14:04:42] [I] Build only: Disabled
[08/12/2024-14:04:42] [I] Save engine:
[08/12/2024-14:04:42] [I] Load engine: yolov8s_dla_b1_int8.engine
[08/12/2024-14:04:42] [I] Profiling verbosity: 0
[08/12/2024-14:04:42] [I] Tactic sources: Using default tactic sources
[08/12/2024-14:04:42] [I] timingCacheMode: local
[08/12/2024-14:04:42] [I] timingCacheFile:
[08/12/2024-14:04:42] [I] Heuristic: Disabled
[08/12/2024-14:04:42] [I] Preview Features: Use default preview flags.
[08/12/2024-14:04:42] [I] Input(s)s format: fp32:CHW
[08/12/2024-14:04:42] [I] Output(s)s format: fp32:CHW
[08/12/2024-14:04:42] [I] Input build shapes: model
[08/12/2024-14:04:42] [I] Input calibration shapes: model
[08/12/2024-14:04:42] [I] === System Options ===
[08/12/2024-14:04:42] [I] Device: 0
[08/12/2024-14:04:42] [I] DLACore: 1
[08/12/2024-14:04:42] [I] Plugins:
[08/12/2024-14:04:42] [I] === Inference Options ===
[08/12/2024-14:04:42] [I] Batch: 1
[08/12/2024-14:04:42] [I] Input inference shapes: model
[08/12/2024-14:04:42] [I] Iterations: 10
[08/12/2024-14:04:42] [I] Duration: 3s (+ 200ms warm up)
[08/12/2024-14:04:42] [I] Sleep time: 0ms
[08/12/2024-14:04:42] [I] Idle time: 0ms
[08/12/2024-14:04:42] [I] Streams: 1
[08/12/2024-14:04:42] [I] ExposeDMA: Disabled
[08/12/2024-14:04:42] [I] Data transfers: Enabled
[08/12/2024-14:04:42] [I] Spin-wait: Disabled
[08/12/2024-14:04:42] [I] Multithreading: Disabled
[08/12/2024-14:04:42] [I] CUDA Graph: Disabled
[08/12/2024-14:04:42] [I] Separate profiling: Disabled
[08/12/2024-14:04:42] [I] Time Deserialize: Disabled
[08/12/2024-14:04:42] [I] Time Refit: Disabled
[08/12/2024-14:04:42] [I] NVTX verbosity: 0
[08/12/2024-14:04:42] [I] Persistent Cache Ratio: 0
[08/12/2024-14:04:42] [I] Inputs:
[08/12/2024-14:04:42] [I] === Reporting Options ===
[08/12/2024-14:04:42] [I] Verbose: Disabled
[08/12/2024-14:04:42] [I] Averages: 10 inferences
[08/12/2024-14:04:42] [I] Percentiles: 90,95,99
[08/12/2024-14:04:42] [I] Dump refittable layers:Disabled
[08/12/2024-14:04:42] [I] Dump output: Disabled
[08/12/2024-14:04:42] [I] Profile: Enabled
[08/12/2024-14:04:42] [I] Export timing to JSON file:
[08/12/2024-14:04:42] [I] Export output to JSON file:
[08/12/2024-14:04:42] [I] Export profile to JSON file:
[08/12/2024-14:04:42] [I]
[08/12/2024-14:04:42] [I] === Device Information ===
[08/12/2024-14:04:42] [I] Selected Device: Orin
[08/12/2024-14:04:42] [I] Compute Capability: 8.7
[08/12/2024-14:04:42] [I] SMs: 16
[08/12/2024-14:04:42] [I] Compute Clock Rate: 1.3 GHz
[08/12/2024-14:04:42] [I] Device Global Memory: 30592 MiB
[08/12/2024-14:04:42] [I] Shared Memory per SM: 164 KiB
[08/12/2024-14:04:42] [I] Memory Bus Width: 256 bits (ECC disabled)
[08/12/2024-14:04:42] [I] Memory Clock Rate: 1.3 GHz
[08/12/2024-14:04:42] [I]
[08/12/2024-14:04:42] [I] TensorRT version: 8.5.2
[08/12/2024-14:04:42] [I] Engine loaded in 0.0073156 sec.
[08/12/2024-14:04:43] [I] [TRT] Loaded engine size: 12 MiB
[08/12/2024-14:04:43] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +12, GPU +0, now: CPU 12, GPU 0 (MiB)
[08/12/2024-14:04:43] [I] Engine deserialized in 0.426043 sec.
[08/12/2024-14:04:43] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +6, now: CPU 12, GPU 6 (MiB)
[08/12/2024-14:04:43] [I] Setting persistentCacheLimit to 0 bytes.
[08/12/2024-14:04:43] [I] Using random values for input input
[08/12/2024-14:04:43] [I] Created input binding for input with dimensions 1x3x512x1408
[08/12/2024-14:04:43] [I] Using random values for output boxes
[08/12/2024-14:04:43] [I] Created output binding for boxes with dimensions 1x14784x4
[08/12/2024-14:04:43] [I] Using random values for output scores
[08/12/2024-14:04:43] [I] Created output binding for scores with dimensions 1x14784x1
[08/12/2024-14:04:43] [I] Using random values for output classes
[08/12/2024-14:04:43] [I] Created output binding for classes with dimensions 1x14784x1
[08/12/2024-14:04:43] [I] Starting inference
[08/12/2024-14:04:46] [I] The e2e network timing is not reported since it is inaccurate due to the extra synchronizations when the profiler is enabled.
[08/12/2024-14:04:46] [I] To show e2e network timing report, add --separateProfileRun to profile layer timing in a separate run or remove --dumpProfile to disable the profiler.
[08/12/2024-14:04:46] [I]
[08/12/2024-14:04:46] [I] === Profile (153 iterations ) ===
[08/12/2024-14:04:46] [I] Layer Time (ms) Avg. Time (ms) Median Time (ms) Time %
[08/12/2024-14:04:46] [I] Reformatting CopyNode for Input Tensor 0 to {ForeignNode[/0/model.0/conv/Conv…/0/model.22/Concat_2]} 16.65 0.1089 0.1068 0.5
[08/12/2024-14:04:46] [I] {ForeignNode[/0/model.0/conv/Conv…/0/model.22/Concat_2]} 3094.70 20.2268 20.2066 96.8
[08/12/2024-14:04:46] [I] Reformatting CopyNode for Input Tensor 0 to /0/model.22/Reshape 8.92 0.0583 0.0582 0.3
[08/12/2024-14:04:46] [I] /0/model.22/Reshape_copy_output 1.91 0.0125 0.0124 0.1
[08/12/2024-14:04:46] [I] Reformatting CopyNode for Input Tensor 0 to /0/model.22/Reshape_1 2.99 0.0196 0.0196 0.1
[08/12/2024-14:04:46] [I] /0/model.22/Reshape_1_copy_output 1.01 0.0066 0.0067 0.0
[08/12/2024-14:04:46] [I] Reformatting CopyNode for Input Tensor 0 to /0/model.22/Reshape_2 1.41 0.0092 0.0092 0.0
[08/12/2024-14:04:46] [I] /0/model.22/Reshape_2_copy_output 0.81 0.0053 0.0053 0.0
[08/12/2024-14:04:46] [I] /0/model.22/dfl/Reshape + /0/model.22/dfl/Transpose 7.90 0.0516 0.0516 0.2
[08/12/2024-14:04:46] [I] /0/model.22/dfl/Softmax 4.09 0.0267 0.0267 0.1
[08/12/2024-14:04:46] [I] Reformatting CopyNode for Input Tensor 0 to /0/model.22/dfl/conv/Conv 8.62 0.0564 0.0564 0.3
[08/12/2024-14:04:46] [I] /0/model.22/dfl/conv/Conv 4.67 0.0305 0.0306 0.1
[08/12/2024-14:04:46] [I] /0/model.22/Expand 0.98 0.0064 0.0064 0.0
[08/12/2024-14:04:46] [I] /0/model.22/Expand_1 0.89 0.0058 0.0059 0.0
[08/12/2024-14:04:46] [I] /0/model.22/Expand_2 0.84 0.0055 0.0055 0.0
[08/12/2024-14:04:46] [I] /0/model.22/Expand_3 0.90 0.0059 0.0060 0.0
[08/12/2024-14:04:46] [I] /0/model.22/Expand_4 0.84 0.0055 0.0055 0.0
[08/12/2024-14:04:46] [I] /0/model.22/Expand_5 0.82 0.0053 0.0054 0.0
[08/12/2024-14:04:46] [I] /0/model.22/Unsqueeze_1_copy_output 0.94 0.0062 0.0062 0.0
[08/12/2024-14:04:46] [I] /0/model.22/ConstantOfShape 0.87 0.0057 0.0057 0.0
[08/12/2024-14:04:46] [I] /0/model.22/Unsqueeze_3_copy_output 0.81 0.0053 0.0053 0.0
[08/12/2024-14:04:46] [I] /0/model.22/ConstantOfShape_1 0.80 0.0053 0.0052 0.0
[08/12/2024-14:04:46] [I] /0/model.22/Unsqueeze_5_copy_output 0.79 0.0051 0.0051 0.0
[08/12/2024-14:04:46] [I] /0/model.22/ConstantOfShape_2 0.73 0.0047 0.0048 0.0
[08/12/2024-14:04:46] [I] /0/model.22/Squeeze + (Unnamed Layer* 244) [Shuffle]_copy_input 0.84 0.0055 0.0055 0.0
[08/12/2024-14:04:46] [I] /0/model.22/Squeeze_1 + (Unnamed Layer* 259) [Shuffle]_copy_input 0.71 0.0047 0.0046 0.0
[08/12/2024-14:04:46] [I] /0/model.22/Squeeze_2 + (Unnamed Layer* 274) [Shuffle]_copy_input 0.70 0.0046 0.0046 0.0
[08/12/2024-14:04:46] [I] /0/model.22/Unsqueeze_output_0 copy 0.93 0.0061 0.0060 0.0
[08/12/2024-14:04:46] [I] /0/model.22/Unsqueeze_2_output_0 copy 0.88 0.0058 0.0057 0.0
[08/12/2024-14:04:46] [I] /0/model.22/Unsqueeze_4_output_0 copy 0.77 0.0050 0.0050 0.0
[08/12/2024-14:04:46] [I] /0/model.22/Reshape_3_copy_output 0.82 0.0053 0.0053 0.0
[08/12/2024-14:04:46] [I] /0/model.22/Reshape_4_copy_output 0.71 0.0047 0.0046 0.0
[08/12/2024-14:04:46] [I] /0/model.22/Reshape_5_copy_output 0.71 0.0047 0.0046 0.0
[08/12/2024-14:04:46] [I] /0/model.22/Transpose + /0/model.22/Unsqueeze_6 0.97 0.0064 0.0064 0.0
[08/12/2024-14:04:46] [I] Reformatting CopyNode for Output Tensor 0 to /0/model.22/Transpose + /0/model.22/Unsqueeze_6 1.02 0.0067 0.0067 0.0
[08/12/2024-14:04:46] [I] Reformatting CopyNode for Input Tensor 0 to /0/model.22/dfl/Reshape_1 1.31 0.0086 0.0086 0.0
[08/12/2024-14:04:46] [I] /0/model.22/dfl/Reshape_1 1.10 0.0072 0.0072 0.0
[08/12/2024-14:04:46] [I] Reformatting CopyNode for Input Tensor 0 to PWN(/0/model.22/Add) 0.94 0.0061 0.0061 0.0
[08/12/2024-14:04:46] [I] PWN(/0/model.22/Add) 0.92 0.0060 0.0060 0.0
[08/12/2024-14:04:46] [I] Reformatting CopyNode for Input Tensor 0 to PWN(/0/model.22/Add_1) 0.78 0.0051 0.0051 0.0
[08/12/2024-14:04:46] [I] PWN(/0/model.22/Add_1) 0.85 0.0055 0.0055 0.0
[08/12/2024-14:04:46] [I] Reformatting CopyNode for Input Tensor 0 to PWN(/0/model.22/Add_2) 0.77 0.0050 0.0050 0.0
[08/12/2024-14:04:46] [I] PWN(/0/model.22/Add_2) 0.83 0.0054 0.0055 0.0
[08/12/2024-14:04:46] [I] /0/model.22/Sub 1.28 0.0084 0.0083 0.0
[08/12/2024-14:04:46] [I] PWN(/0/model.22/Add_4) 1.20 0.0078 0.0077 0.0
[08/12/2024-14:04:46] [I] /0/model.22/Sub_1 1.11 0.0072 0.0073 0.0
[08/12/2024-14:04:46] [I] Reformatting CopyNode for Output Tensor 0 to /0/model.22/Sub_1 1.08 0.0070 0.0070 0.0
[08/12/2024-14:04:46] [I] PWN(/0/model.22/Constant_36_output_0 + (Unnamed Layer* 349) [Shuffle], PWN(/0/model.22/Add_5, /0/model.22/Div_1)) 1.76 0.0115 0.0115 0.1
[08/12/2024-14:04:46] [I] /0/model.22/Div_1_output_0 copy 1.07 0.0070 0.0070 0.0
[08/12/2024-14:04:46] [I] Reformatting CopyNode for Input Tensor 1 to /0/model.22/Mul_2 0.83 0.0055 0.0054 0.0
[08/12/2024-14:04:46] [I] /0/model.22/Mul_2 1.08 0.0071 0.0070 0.0
[08/12/2024-14:04:46] [I] PWN(/0/model.22/Sigmoid) 0.95 0.0062 0.0061 0.0
[08/12/2024-14:04:46] [I] /1/Transpose 1.23 0.0080 0.0080 0.0
[08/12/2024-14:04:46] [I] /1/Slice 0.94 0.0061 0.0061 0.0
[08/12/2024-14:04:46] [I] /1/Slice_1 0.81 0.0053 0.0053 0.0
[08/12/2024-14:04:46] [I] /1/ReduceMax 1.06 0.0070 0.0069 0.0
[08/12/2024-14:04:46] [I] /1/ArgMax 1.11 0.0072 0.0072 0.0
[08/12/2024-14:04:46] [I] /1/Cast 0.75 0.0049 0.0049 0.0
[08/12/2024-14:04:46] [I] Total 3196.70 20.8935 20.8717 100.0
[08/12/2024-14:04:46] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v8502] # /usr/src/tensorrt/bin/trtexec --loadEngine=yolov8s_dla_b1_int8.engine --useDLACore=1 --dumpProfile

mchi · August 13, 2024, 7:59am

yolov8s_1400_512_bs1.onnx (42.9 MB)

1, Did you boost the CPU/GPU/EMC with the commands I shared in my 1st comment?
2. I exported an 1408x512 resolution Y8s onnx as attachment. With my instructions above, we can get 82 FPS (= 1000 ms / 12.1155) with bs=1.

raghavendra.ramya · August 13, 2024, 8:20pm

I used this command to generate the engine file:
/usr/src/tensorrt/bin/trtexec --onnx=yolov8s_1400_512_bs1.onnx --int8 --fp16 --best --useDLACore=1 --allowGPUFallback --saveEngine=./yolov8s_dla_b1_int8_nvidia.engine

Then ran trtexec with --dumpProfile:
/usr/src/tensorrt/bin/trtexec --loadEngine=yolov8s_dla_b1_int8_nvidia.engine --useDLACore=1 --dumpProfile
&&&& RUNNING TensorRT.trtexec [TensorRT v8502] # /usr/src/tensorrt/bin/trtexec --loadEngine=yolov8s_dla_b1_int8_nvidia.engine --useDLACore=1 --dumpProfile
[08/13/2024-14:50:11] [I] === Model Options ===
[08/13/2024-14:50:11] [I] Format: *
[08/13/2024-14:50:11] [I] Model:
[08/13/2024-14:50:11] [I] Output:
[08/13/2024-14:50:11] [I] === Build Options ===
[08/13/2024-14:50:11] [I] Max batch: 1
[08/13/2024-14:50:11] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default
[08/13/2024-14:50:11] [I] minTiming: 1
[08/13/2024-14:50:11] [I] avgTiming: 8
[08/13/2024-14:50:11] [I] Precision: FP32
[08/13/2024-14:50:11] [I] LayerPrecisions:
[08/13/2024-14:50:11] [I] Calibration:
[08/13/2024-14:50:11] [I] Refit: Disabled
[08/13/2024-14:50:11] [I] Sparsity: Disabled
[08/13/2024-14:50:11] [I] Safe mode: Disabled
[08/13/2024-14:50:11] [I] DirectIO mode: Disabled
[08/13/2024-14:50:11] [I] Restricted mode: Disabled
[08/13/2024-14:50:11] [I] Build only: Disabled
[08/13/2024-14:50:11] [I] Save engine:
[08/13/2024-14:50:11] [I] Load engine: yolov8s_dla_b1_int8_nvidia.engine
[08/13/2024-14:50:11] [I] Profiling verbosity: 0
[08/13/2024-14:50:11] [I] Tactic sources: Using default tactic sources
[08/13/2024-14:50:11] [I] timingCacheMode: local
[08/13/2024-14:50:11] [I] timingCacheFile:
[08/13/2024-14:50:11] [I] Heuristic: Disabled
[08/13/2024-14:50:11] [I] Preview Features: Use default preview flags.
[08/13/2024-14:50:11] [I] Input(s)s format: fp32:CHW
[08/13/2024-14:50:11] [I] Output(s)s format: fp32:CHW
[08/13/2024-14:50:11] [I] Input build shapes: model
[08/13/2024-14:50:11] [I] Input calibration shapes: model
[08/13/2024-14:50:11] [I] === System Options ===
[08/13/2024-14:50:11] [I] Device: 0
[08/13/2024-14:50:11] [I] DLACore: 1
[08/13/2024-14:50:11] [I] Plugins:
[08/13/2024-14:50:11] [I] === Inference Options ===
[08/13/2024-14:50:11] [I] Batch: 1
[08/13/2024-14:50:11] [I] Input inference shapes: model
[08/13/2024-14:50:11] [I] Iterations: 10
[08/13/2024-14:50:11] [I] Duration: 3s (+ 200ms warm up)
[08/13/2024-14:50:11] [I] Sleep time: 0ms
[08/13/2024-14:50:11] [I] Idle time: 0ms
[08/13/2024-14:50:11] [I] Streams: 1
[08/13/2024-14:50:11] [I] ExposeDMA: Disabled
[08/13/2024-14:50:11] [I] Data transfers: Enabled
[08/13/2024-14:50:11] [I] Spin-wait: Disabled
[08/13/2024-14:50:11] [I] Multithreading: Disabled
[08/13/2024-14:50:11] [I] CUDA Graph: Disabled
[08/13/2024-14:50:11] [I] Separate profiling: Disabled
[08/13/2024-14:50:11] [I] Time Deserialize: Disabled
[08/13/2024-14:50:11] [I] Time Refit: Disabled
[08/13/2024-14:50:11] [I] NVTX verbosity: 0
[08/13/2024-14:50:11] [I] Persistent Cache Ratio: 0
[08/13/2024-14:50:11] [I] Inputs:
[08/13/2024-14:50:11] [I] === Reporting Options ===
[08/13/2024-14:50:11] [I] Verbose: Disabled
[08/13/2024-14:50:11] [I] Averages: 10 inferences
[08/13/2024-14:50:11] [I] Percentiles: 90,95,99
[08/13/2024-14:50:11] [I] Dump refittable layers:Disabled
[08/13/2024-14:50:11] [I] Dump output: Disabled
[08/13/2024-14:50:11] [I] Profile: Enabled
[08/13/2024-14:50:11] [I] Export timing to JSON file:
[08/13/2024-14:50:11] [I] Export output to JSON file:
[08/13/2024-14:50:11] [I] Export profile to JSON file:
[08/13/2024-14:50:11] [I]
[08/13/2024-14:50:11] [I] === Device Information ===
[08/13/2024-14:50:11] [I] Selected Device: Orin
[08/13/2024-14:50:11] [I] Compute Capability: 8.7
[08/13/2024-14:50:11] [I] SMs: 16
[08/13/2024-14:50:11] [I] Compute Clock Rate: 1.3 GHz
[08/13/2024-14:50:11] [I] Device Global Memory: 30592 MiB
[08/13/2024-14:50:11] [I] Shared Memory per SM: 164 KiB
[08/13/2024-14:50:11] [I] Memory Bus Width: 256 bits (ECC disabled)
[08/13/2024-14:50:11] [I] Memory Clock Rate: 1.3 GHz
[08/13/2024-14:50:11] [I]
[08/13/2024-14:50:11] [I] TensorRT version: 8.5.2
[08/13/2024-14:50:11] [I] Engine loaded in 0.00809728 sec.
[08/13/2024-14:50:12] [I] [TRT] Loaded engine size: 12 MiB
[08/13/2024-14:50:12] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +12, GPU +0, now: CPU 12, GPU 0 (MiB)
[08/13/2024-14:50:12] [I] Engine deserialized in 0.427923 sec.
[08/13/2024-14:50:12] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +12, now: CPU 12, GPU 12 (MiB)
[08/13/2024-14:50:12] [I] Setting persistentCacheLimit to 0 bytes.
[08/13/2024-14:50:12] [I] Using random values for input images
[08/13/2024-14:50:12] [I] Created input binding for images with dimensions 1x3x1408x512
[08/13/2024-14:50:12] [I] Using random values for output output0
[08/13/2024-14:50:12] [I] Created output binding for output0 with dimensions 1x84x14784
[08/13/2024-14:50:12] [I] Starting inference
[08/13/2024-14:50:15] [I] The e2e network timing is not reported since it is inaccurate due to the extra synchronizations when the profiler is enabled.
[08/13/2024-14:50:15] [I] To show e2e network timing report, add --separateProfileRun to profile layer timing in a separate run or remove --dumpProfile to disable the profiler.
[08/13/2024-14:50:15] [I]
[08/13/2024-14:50:15] [I] === Profile (151 iterations ) ===
[08/13/2024-14:50:15] [I] Layer Time (ms) Avg. Time (ms) Median Time (ms) Time %
[08/13/2024-14:50:15] [I] Reformatting CopyNode for Input Tensor 0 to {ForeignNode[/model.0/conv/Conv…/model.22/Concat_2]} 16.49 0.1092 0.1071 0.5
[08/13/2024-14:50:15] [I] {ForeignNode[/model.0/conv/Conv…/model.22/Concat_2]} 3112.14 20.6102 20.5851 97.0
[08/13/2024-14:50:15] [I] Reformatting CopyNode for Input Tensor 0 to /model.22/Reshape 12.23 0.0810 0.0810 0.4
[08/13/2024-14:50:15] [I] /model.22/Reshape_copy_output 8.57 0.0568 0.0567 0.3
[08/13/2024-14:50:15] [I] Reformatting CopyNode for Input Tensor 0 to /model.22/Reshape_1 4.21 0.0279 0.0279 0.1
[08/13/2024-14:50:15] [I] /model.22/Reshape_1_copy_output 1.93 0.0128 0.0128 0.1
[08/13/2024-14:50:15] [I] Reformatting CopyNode for Input Tensor 0 to /model.22/Reshape_2 1.95 0.0129 0.0129 0.1
[08/13/2024-14:50:15] [I] /model.22/Reshape_2_copy_output 0.95 0.0063 0.0063 0.0
[08/13/2024-14:50:15] [I] /model.22/dfl/Reshape + /model.22/dfl/Transpose 8.26 0.0547 0.0547 0.3
[08/13/2024-14:50:15] [I] /model.22/dfl/Softmax 4.00 0.0265 0.0264 0.1
[08/13/2024-14:50:15] [I] Reformatting CopyNode for Input Tensor 0 to /model.22/dfl/conv/Conv 8.52 0.0564 0.0564 0.3
[08/13/2024-14:50:15] [I] /model.22/dfl/conv/Conv 4.51 0.0299 0.0299 0.1
[08/13/2024-14:50:15] [I] Reformatting CopyNode for Input Tensor 0 to /model.22/dfl/Reshape_1 1.48 0.0098 0.0098 0.0
[08/13/2024-14:50:15] [I] /model.22/dfl/Reshape_1 1.09 0.0072 0.0072 0.0
[08/13/2024-14:50:15] [I] Reformatting CopyNode for Input Tensor 0 to /model.22/Sub 1.21 0.0080 0.0080 0.0
[08/13/2024-14:50:15] [I] /model.22/Sub 1.24 0.0082 0.0082 0.0
[08/13/2024-14:50:15] [I] Reformatting CopyNode for Input Tensor 0 to PWN(/model.22/Add_1) 0.90 0.0059 0.0060 0.0
[08/13/2024-14:50:15] [I] PWN(/model.22/Add_1) 1.12 0.0074 0.0072 0.0
[08/13/2024-14:50:15] [I] /model.22/Sub_1 1.07 0.0071 0.0071 0.0
[08/13/2024-14:50:15] [I] Reformatting CopyNode for Output Tensor 0 to /model.22/Sub_1 1.10 0.0073 0.0075 0.0
[08/13/2024-14:50:15] [I] PWN(/model.22/Constant_11_output_0 + (Unnamed Layer* 294) [Shuffle], PWN(/model.22/Add_2, /model.22/Div_1)) 1.47 0.0098 0.0098 0.0
[08/13/2024-14:50:15] [I] /model.22/Div_1_output_0 copy 1.00 0.0066 0.0067 0.0
[08/13/2024-14:50:15] [I] /model.22/Mul_2 1.30 0.0086 0.0085 0.0
[08/13/2024-14:50:15] [I] PWN(/model.22/Sigmoid) 4.38 0.0290 0.0290 0.1
[08/13/2024-14:50:15] [I] Reformatting CopyNode for Output Tensor 0 to PWN(/model.22/Sigmoid) 6.44 0.0427 0.0426 0.2
[08/13/2024-14:50:15] [I] Total 3207.56 21.2421 21.2149 100.0
[08/13/2024-14:50:15] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v8502] # /usr/src/tensorrt/bin/trtexec --loadEngine=yolov8s_dla_b1_int8_nvidia.engine --useDLACore=1 --dumpProfile

I am getting 47 FPS on my Orin Dev kit, with JP 5.1.2.

I am concerned about why the power drawn when I run the model on DLA is the same as power drawn when I run the model on the GPU. I am using Deepstream 6.3. See my first post for details.

mchi · August 14, 2024, 12:18am

Ah… I think the perf should be due to the Jetpack version.

As mentioned in https://developer.nvidia.com/blog/deploying-yolov5-on-nvidia-jetson-orin-with-cudla-quantization-aware-training-to-inference/#yolov5_dla_performance , we have performnace improvement in JP6.0 for Yolo network.
Can you upgrade the Jetpack version and DeepStream?

Thanks!

raghavendra.ramya · August 14, 2024, 1:06am

I will update the jetpack version and run it.
But, why is it that I am getting the same power consumption on DLA as the GPU? The DLA must be more power efficient than GPU?
I have a few other questions - is it possible to use both of the DLAs in one deepstream instance to get improved performance?

Thanks again for your help.

mchi · August 14, 2024, 2:07am

why is it that I am getting the same power consumption on DLA as the GPU? The DLA must be more power efficient than GPU?

In your test, did you check the power when running the same loading on DLA and GPU?

is it possible to use both of the DLAs in one deepstream instance to get improved performance?

Yes. That’s feasible by running GPU and 2 DLAs with seperated nvdsinfer plugin.

raghavendra.ramya · August 14, 2024, 8:58pm

I think I have the same loading on GPU and DLA. I am using the same onnx file in Deepstream with the settings to run the engine on DLA/GPU. Does this guarantee same loading?

With a separated nvdsinfer plugin, are we running separate AI engines on GPU and the two DLAs or are we spreading one engine across them? Can you please point me to an example?

mchi · August 15, 2024, 2:04am

Same loading means same ONNX, inference precision, fps, etc.

With a separated nvdsinfer plugin, are we running separate AI engines on GPU and the two DLAs or are we spreading one engine across them? Can you please point me to an example?

AFAIK, no such example, I will double check and get back to you later.

raghavendra.ramya · August 15, 2024, 4:07pm

Yes, it is the same ONNX, inference, precision, fps. I had forgotten to mention this - I ran it on an AGX Orin with JP5.1.2 at 30W power. When I run the models on DLA on MAXN power mode, I do see a difference in power consumption between DLA and GPU.

mchi · August 21, 2024, 6:19am

Hi @raghavendra.ramya ,
If you want to run GPU and DLA in parallel for same batch of source, you can refer to GitHub - NVIDIA-AI-IOT/deepstream_parallel_inference_app: A project demonstrating how to use nvmetamux to run multiple models in parallel. sample .

I do see a difference in power consumption between DLA and GPU

Yes, this is expected.

Let me know if you have further questions.

Thanks!

yingliu · September 23, 2024, 8:13am

There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks

system · October 7, 2024, 8:14am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
TensorRT model inference fully on DLA is slow due to abnormally slow cudaEventSynchronize time Jetson AGX Orin tensorrt , cuda , dla	10	1562	January 17, 2024
DLA_STANDALONE error in forceToUseNvmIO Jetson AGX Xavier dla	15	1278	February 9, 2023
Trying to convert Yolov8.onnx into trt ( TensorRT version : 8.2, jetson-jetpack : 4.6.1) Jetson Xavier NX tensorrt , cuda , yolo	12	3445	May 17, 2023
Using dla on orin nx meet an error Jetson AGX Xavier dla	9	153	September 8, 2024
Error loading .trt model Jetson AGX Orin tensorrt	7	164	November 6, 2024
Model onnx trt engine generation process report different results compared between PC and jetson XAVIER NX Jetson Xavier NX tensorrt	19	1027	September 28, 2022
Engine creation fails when using DLA with GPU fallback Jetson AGX Xavier tensorrt , dla	11	2007	March 22, 2022
Xavier NX does not support adaptative average pooling on DLA? Jetson Xavier NX tensorrt	27	1144	October 11, 2023
Process killed during tensorrt conversion on Jetson orin NX (8 GB) Jetson Orin NX tensorrt	15	749	April 30, 2024
Jetson AGX Orin，how to use DLA for yolov2_tiny Jetson AGX Orin dla	16	1918	April 27, 2023

DLA performance

Related topics