AGX Orin DeepStream 6.1.1 FPS of deepstream-app use DLA low than AGX Xavier

Hi,
I test deepstream use DLA on xavier and orin,and found FPS on Orin low than Xavier, my config :
front60-30_1080p_dec_infer-resnet_tiled_display_fp16_dla_video6.txt (5.9 KB)
config_infer_primary_dla_fp16_video6.txt (4.0 KB)
input file is :/opt/nvidia/deepstream/deepstream-6.1/samples/streams/front-60.mp4
test command:

cp front60-30_1080p_dec_infer-resnet_tiled_display_fp16_dla_video6.txt /opt/nvidia/deepstream/deepstream-6.1/samples/
deepstream-app -c configs/deepstream-app/
cp config_infer_primary_dla_fp16_video6.txt  /opt/nvidia/deepstream/deepstream-6.1/samples/
deepstream-app -c configs/deepstream-app/
cd /opt/nvidia/deepstream/deepstream-6.1/samples/
deepstream-app -c configs/deepstream-app/front60-30_1080p_dec_infer-resnet_tiled_display_fp16_dla_video6.txt

On Xavier fps is 39 but on Orin just 10 fps
The Env of Xavier and Orin Env is the same:

Jetpack: 5.0.2 [L4T 35.1.0]
CUDA: 11.04
cuDNN: 8.4.1.50
TensorRT: 8.4.1.5

Is my configuration problem?

What is the situation without DLA? Is Orin performance better than Xavier?

it’s also not better. I test with 8 source(batch size 8). use GPU, on xavier is 60fps,but on orin is 53 fps
the config:
config_infer_primary_gpu_fp16_video8.txt (4.0 KB)
front60-30_1080p_dec_infer-resnet_tiled_display_fp16_gpu_video8.txt (6.5 KB)

(6 batch)use DLA, on orin devkit:


on xavier devkit:


So it has nothing to do with DLA, right?

We don’t know which Orin and Xavier devices you are using and what is your testing video’s resolution and FPS. From the technical specification in the page Jetson Modules, Support, Ecosystem, and Lineup | NVIDIA Developer, Orin has worse video decode capability than Xavier. Maybe video decoder is the bottleneck for your case.

Hi Fiona.Chen,Thank you very much for reply.
it’s looks DLA use long time than xavier with nsight; And according to the Doc our Orin decoder performance is half of Xavier.But use GPU and DLA,the fps is No proportional relationship.

here are info of device and video file:
Orin devkit:
64GB 12core
xavier devkit:
32GB 8core
input mp4 is
image
Maybe, could you use our configuration file to test Xavier and Orin?

To compare the DLA performance directly, can you try with “trtexec” tool to measure the inferencing time with DLA enabled?

Just add “–useDLACore=0 --allowGPUFallback” options when you use “trtexec” to build and inference.

I measure the inferencing time with trtexec ,it’s failed:

/opt/nvidia/deepstream/deepstream-6.1/samples/models/Primary_Detector$ trtexec --loadEngine=resnet10.caffemodel_b6_dla0_fp16.engine 
&&&& RUNNING TensorRT.trtexec [TensorRT v8401] # /usr/src/tensorrt/bin/trtexec --loadEngine=resnet10.caffemodel_b6_dla0_fp16.engine
[10/26/2022-20:28:53] [I] === Model Options ===
[10/26/2022-20:28:53] [I] Format: *
[10/26/2022-20:28:53] [I] Model: 
[10/26/2022-20:28:53] [I] Output:
[10/26/2022-20:28:53] [I] === Build Options ===
[10/26/2022-20:28:53] [I] Max batch: 1
[10/26/2022-20:28:53] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default
[10/26/2022-20:28:53] [I] minTiming: 1
[10/26/2022-20:28:53] [I] avgTiming: 8
[10/26/2022-20:28:53] [I] Precision: FP32
[10/26/2022-20:28:53] [I] LayerPrecisions: 
[10/26/2022-20:28:53] [I] Calibration: 
[10/26/2022-20:28:53] [I] Refit: Disabled
[10/26/2022-20:28:53] [I] Sparsity: Disabled
[10/26/2022-20:28:53] [I] Safe mode: Disabled
[10/26/2022-20:28:53] [I] DirectIO mode: Disabled
[10/26/2022-20:28:53] [I] Restricted mode: Disabled
[10/26/2022-20:28:53] [I] Build only: Disabled
[10/26/2022-20:28:53] [I] Save engine: 
[10/26/2022-20:28:53] [I] Load engine: resnet10.caffemodel_b6_dla0_fp16.engine
[10/26/2022-20:28:53] [I] Profiling verbosity: 0
[10/26/2022-20:28:53] [I] Tactic sources: Using default tactic sources
[10/26/2022-20:28:53] [I] timingCacheMode: local
[10/26/2022-20:28:53] [I] timingCacheFile: 
[10/26/2022-20:28:53] [I] Input(s)s format: fp32:CHW
[10/26/2022-20:28:53] [I] Output(s)s format: fp32:CHW
[10/26/2022-20:28:53] [I] Input build shapes: model
[10/26/2022-20:28:53] [I] Input calibration shapes: model
[10/26/2022-20:28:53] [I] === System Options ===
[10/26/2022-20:28:53] [I] Device: 0
[10/26/2022-20:28:53] [I] DLACore: 
[10/26/2022-20:28:53] [I] Plugins:
[10/26/2022-20:28:53] [I] === Inference Options ===
[10/26/2022-20:28:53] [I] Batch: 1
[10/26/2022-20:28:53] [I] Input inference shapes: model
[10/26/2022-20:28:53] [I] Iterations: 10
[10/26/2022-20:28:53] [I] Duration: 3s (+ 200ms warm up)
[10/26/2022-20:28:53] [I] Sleep time: 0ms
[10/26/2022-20:28:53] [I] Idle time: 0ms
[10/26/2022-20:28:53] [I] Streams: 1
[10/26/2022-20:28:53] [I] ExposeDMA: Disabled
[10/26/2022-20:28:53] [I] Data transfers: Enabled
[10/26/2022-20:28:53] [I] Spin-wait: Disabled
[10/26/2022-20:28:53] [I] Multithreading: Disabled
[10/26/2022-20:28:53] [I] CUDA Graph: Disabled
[10/26/2022-20:28:53] [I] Separate profiling: Disabled
[10/26/2022-20:28:53] [I] Time Deserialize: Disabled
[10/26/2022-20:28:53] [I] Time Refit: Disabled
[10/26/2022-20:28:53] [I] Inputs:
[10/26/2022-20:28:53] [I] === Reporting Options ===
[10/26/2022-20:28:53] [I] Verbose: Disabled
[10/26/2022-20:28:53] [I] Averages: 10 inferences
[10/26/2022-20:28:53] [I] Percentile: 99
[10/26/2022-20:28:53] [I] Dump refittable layers:Disabled
[10/26/2022-20:28:53] [I] Dump output: Disabled
[10/26/2022-20:28:53] [I] Profile: Disabled
[10/26/2022-20:28:53] [I] Export timing to JSON file: 
[10/26/2022-20:28:53] [I] Export output to JSON file: 
[10/26/2022-20:28:53] [I] Export profile to JSON file: 
[10/26/2022-20:28:53] [I] 
[10/26/2022-20:28:53] [I] === Device Information ===
[10/26/2022-20:28:53] [I] Selected Device: Orin
[10/26/2022-20:28:53] [I] Compute Capability: 8.7
[10/26/2022-20:28:53] [I] SMs: 16
[10/26/2022-20:28:53] [I] Compute Clock Rate: 1.3 GHz
[10/26/2022-20:28:53] [I] Device Global Memory: 30535 MiB
[10/26/2022-20:28:53] [I] Shared Memory per SM: 164 KiB
[10/26/2022-20:28:53] [I] Memory Bus Width: 128 bits (ECC disabled)
[10/26/2022-20:28:53] [I] Memory Clock Rate: 1.3 GHz
[10/26/2022-20:28:53] [I] 
[10/26/2022-20:28:53] [I] TensorRT version: 8.4.1
[10/26/2022-20:28:53] [I] Engine loaded in 0.00268695 sec.
[10/26/2022-20:28:54] [I] [TRT] [MemUsageChange] Init CUDA: CPU +218, GPU +0, now: CPU 245, GPU 23642 (MiB)
[10/26/2022-20:28:54] [I] [TRT] Loaded engine size: 3 MiB
[10/26/2022-20:28:54] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +3, GPU +0, now: CPU 3, GPU 0 (MiB)
[10/26/2022-20:28:54] [I] Engine deserialized in 1.00899 sec.
[10/26/2022-20:28:54] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +51, now: CPU 3, GPU 51 (MiB)
[10/26/2022-20:28:54] [I] Using random values for input input_1
[10/26/2022-20:28:54] [I] Created input binding for input_1 with dimensions 3x368x640
[10/26/2022-20:28:54] [I] Using random values for output conv2d_bbox
[10/26/2022-20:28:54] [I] Created output binding for conv2d_bbox with dimensions 16x23x40
[10/26/2022-20:28:54] [I] Using random values for output conv2d_cov/Sigmoid
[10/26/2022-20:28:54] [I] Created output binding for conv2d_cov/Sigmoid with dimensions 4x23x40
[10/26/2022-20:28:54] [I] Starting inference
[10/26/2022-20:28:55] [E] Error[1]: [nvdlaUtils.cpp::submit::199] Error Code 1: DLA (Failure to submit program to DLA engine.)
[10/26/2022-20:28:55] [E] Error occurred during inference
&&&& FAILED TensorRT.trtexec [TensorRT v8401] # /usr/src/tensorrt/bin/trtexec --loadEngine=resnet10.caffemodel_b6_dla0_fp16.engine
terminate called after throwing an instance of 'nvinfer1::InternalError'
  what():  Assertion !mCudaMemory || !mNvmTensor failed. 
Aborted

resnet10.caffemodel_b6_dla0_fp16.engine is the engin auto created by deepstream-app.
I test to build trt with trtexec:
trtexec --deploy=resnet10.prototxt --model=resnet10.caffemodel --saveEngine=resnet10.engine --fp16 --output=‘conv2d_bbox’ --useDLACore=0 --allowGPUFallback

/opt/nvidia/deepstream/deepstream-6.1/samples/models/Primary_Detector$ trtexec --deploy=resnet10.prototxt --model=resnet10.caffemodel --saveEngine=resnet10.dla.engine --fp16 --output='conv2d_bbox' --useDLACore=0 --allowGPUFallback
&&&& RUNNING TensorRT.trtexec [TensorRT v8401] # /usr/src/tensorrt/bin/trtexec --deploy=resnet10.prototxt --model=resnet10.caffemodel --saveEngine=resnet10.dla.engine --fp16 --output=conv2d_bbox --useDLACore=0 --allowGPUFallback
[10/27/2022-10:48:53] [I] === Model Options ===
[10/27/2022-10:48:53] [I] Format: Caffe
[10/27/2022-10:48:53] [I] Model: resnet10.caffemodel
[10/27/2022-10:48:53] [I] Prototxt: resnet10.prototxt
[10/27/2022-10:48:53] [I] Output: conv2d_bbox
[10/27/2022-10:48:53] [I] === Build Options ===
[10/27/2022-10:48:53] [I] Max batch: 1
[10/27/2022-10:48:53] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default
[10/27/2022-10:48:53] [I] minTiming: 1
[10/27/2022-10:48:53] [I] avgTiming: 8
[10/27/2022-10:48:53] [I] Precision: FP32+FP16
[10/27/2022-10:48:53] [I] LayerPrecisions: 
[10/27/2022-10:48:53] [I] Calibration: 
[10/27/2022-10:48:53] [I] Refit: Disabled
[10/27/2022-10:48:53] [I] Sparsity: Disabled
[10/27/2022-10:48:53] [I] Safe mode: Disabled
[10/27/2022-10:48:53] [I] DirectIO mode: Disabled
[10/27/2022-10:48:53] [I] Restricted mode: Disabled
[10/27/2022-10:48:53] [I] Build only: Disabled
[10/27/2022-10:48:53] [I] Save engine: resnet10.dla.engine
[10/27/2022-10:48:53] [I] Load engine: 
[10/27/2022-10:48:53] [I] Profiling verbosity: 0
[10/27/2022-10:48:53] [I] Tactic sources: Using default tactic sources
[10/27/2022-10:48:53] [I] timingCacheMode: local
[10/27/2022-10:48:53] [I] timingCacheFile: 
[10/27/2022-10:48:53] [I] Input(s)s format: fp32:CHW
[10/27/2022-10:48:53] [I] Output(s)s format: fp32:CHW
[10/27/2022-10:48:53] [I] Input build shapes: model
[10/27/2022-10:48:53] [I] Input calibration shapes: model
[10/27/2022-10:48:53] [I] === System Options ===
[10/27/2022-10:48:53] [I] Device: 0
[10/27/2022-10:48:53] [I] DLACore: 0(With GPU fallback)
[10/27/2022-10:48:53] [I] Plugins:
[10/27/2022-10:48:53] [I] === Inference Options ===
[10/27/2022-10:48:53] [I] Batch: 1
[10/27/2022-10:48:53] [I] Input inference shapes: model
[10/27/2022-10:48:53] [I] Iterations: 10
[10/27/2022-10:48:53] [I] Duration: 3s (+ 200ms warm up)
[10/27/2022-10:48:53] [I] Sleep time: 0ms
[10/27/2022-10:48:53] [I] Idle time: 0ms
[10/27/2022-10:48:53] [I] Streams: 1
[10/27/2022-10:48:53] [I] ExposeDMA: Disabled
[10/27/2022-10:48:53] [I] Data transfers: Enabled
[10/27/2022-10:48:53] [I] Spin-wait: Disabled
[10/27/2022-10:48:53] [I] Multithreading: Disabled
[10/27/2022-10:48:53] [I] CUDA Graph: Disabled
[10/27/2022-10:48:53] [I] Separate profiling: Disabled
[10/27/2022-10:48:53] [I] Time Deserialize: Disabled
[10/27/2022-10:48:53] [I] Time Refit: Disabled
[10/27/2022-10:48:53] [I] Inputs:
[10/27/2022-10:48:53] [I] === Reporting Options ===
[10/27/2022-10:48:53] [I] Verbose: Disabled
[10/27/2022-10:48:53] [I] Averages: 10 inferences
[10/27/2022-10:48:53] [I] Percentile: 99
[10/27/2022-10:48:53] [I] Dump refittable layers:Disabled
[10/27/2022-10:48:53] [I] Dump output: Disabled
[10/27/2022-10:48:53] [I] Profile: Disabled
[10/27/2022-10:48:53] [I] Export timing to JSON file: 
[10/27/2022-10:48:53] [I] Export output to JSON file: 
[10/27/2022-10:48:53] [I] Export profile to JSON file: 
[10/27/2022-10:48:53] [I] 
[10/27/2022-10:48:53] [I] === Device Information ===
[10/27/2022-10:48:53] [I] Selected Device: Orin
[10/27/2022-10:48:53] [I] Compute Capability: 8.7
[10/27/2022-10:48:53] [I] SMs: 16
[10/27/2022-10:48:53] [I] Compute Clock Rate: 1.3 GHz
[10/27/2022-10:48:53] [I] Device Global Memory: 30535 MiB
[10/27/2022-10:48:53] [I] Shared Memory per SM: 164 KiB
[10/27/2022-10:48:53] [I] Memory Bus Width: 128 bits (ECC disabled)
[10/27/2022-10:48:53] [I] Memory Clock Rate: 1.3 GHz
[10/27/2022-10:48:53] [I] 
[10/27/2022-10:48:53] [I] TensorRT version: 8.4.1
[10/27/2022-10:48:53] [I] [TRT] [MemUsageChange] Init CUDA: CPU +218, GPU +0, now: CPU 242, GPU 8269 (MiB)
[10/27/2022-10:48:56] [I] [TRT] [MemUsageChange] Init builder kernel library: CPU +351, GPU +330, now: CPU 612, GPU 8616 (MiB)
[10/27/2022-10:48:56] [W] [TRT] The implicit batch dimension mode has been deprecated. Please create the network with NetworkDefinitionCreationFlag::kEXPLICIT_BATCH flag whenever possible.
[10/27/2022-10:48:56] [I] Start parsing network model
[10/27/2022-10:48:56] [I] Finish parsing network model
[10/27/2022-10:48:57] [I] [TRT] ---------- Layers Running on DLA ----------
[10/27/2022-10:48:57] [I] [TRT] [DlaLayer] {ForeignNode[conv1 + bn_conv1...conv2d_bbox]}
[10/27/2022-10:48:57] [I] [TRT] ---------- Layers Running on GPU ----------
[10/27/2022-10:48:58] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +534, GPU +662, now: CPU 1160, GPU 9309 (MiB)
[10/27/2022-10:48:58] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +86, GPU +143, now: CPU 1246, GPU 9452 (MiB)
[10/27/2022-10:48:58] [I] [TRT] Local timing cache in use. Profiling results in this builder pass will not be stored.
[10/27/2022-10:49:00] [I] [TRT] Detected 1 inputs and 1 output network tensors.
[10/27/2022-10:49:00] [I] [TRT] Total Host Persistent Memory: 848
[10/27/2022-10:49:00] [I] [TRT] Total Device Persistent Memory: 0
[10/27/2022-10:49:00] [I] [TRT] Total Scratch Memory: 0
[10/27/2022-10:49:00] [I] [TRT] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 9 MiB, GPU 9 MiB
[10/27/2022-10:49:00] [I] [TRT] [BlockAssignment] Algorithm ShiftNTopDown took 0.014944ms to assign 2 blocks to 2 nodes requiring 8949760 bytes.
[10/27/2022-10:49:00] [I] [TRT] Total Activation Memory: 8949760
[10/27/2022-10:49:00] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +3, GPU +0, now: CPU 3, GPU 0 (MiB)
[10/27/2022-10:49:00] [I] Engine built in 7.04876 sec.
[10/27/2022-10:49:00] [I] [TRT] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 886, GPU 9495 (MiB)
[10/27/2022-10:49:00] [I] [TRT] Loaded engine size: 3 MiB
[10/27/2022-10:49:00] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +3, GPU +0, now: CPU 3, GPU 0 (MiB)
[10/27/2022-10:49:00] [I] Engine deserialized in 0.00217923 sec.
[10/27/2022-10:49:00] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +8, now: CPU 3, GPU 8 (MiB)
[10/27/2022-10:49:00] [I] Using random values for input input_1
[10/27/2022-10:49:00] [I] Created input binding for input_1 with dimensions 3x368x640
[10/27/2022-10:49:00] [I] Using random values for output conv2d_bbox
[10/27/2022-10:49:00] [I] Created output binding for conv2d_bbox with dimensions 16x23x40
[10/27/2022-10:49:00] [I] Starting inference
[10/27/2022-10:49:03] [I] Warmup completed 9 queries over 200 ms
[10/27/2022-10:49:03] [I] Timing trace has 133 queries over 3.0752 s
[10/27/2022-10:49:03] [I] 
[10/27/2022-10:49:03] [I] === Trace details ===
[10/27/2022-10:49:03] [I] Trace averages of 10 runs:
[10/27/2022-10:49:03] [I] Average on 10 runs - GPU latency: 22.9712 ms - Host latency: 23.1162 ms (enqueue 22.8476 ms)
[10/27/2022-10:49:03] [I] Average on 10 runs - GPU latency: 23.0688 ms - Host latency: 23.2141 ms (enqueue 23.0029 ms)
[10/27/2022-10:49:03] [I] Average on 10 runs - GPU latency: 23.0611 ms - Host latency: 23.2024 ms (enqueue 23.0085 ms)
[10/27/2022-10:49:03] [I] Average on 10 runs - GPU latency: 22.9105 ms - Host latency: 23.0498 ms (enqueue 22.9018 ms)
[10/27/2022-10:49:03] [I] Average on 10 runs - GPU latency: 22.8784 ms - Host latency: 23.0175 ms (enqueue 22.8028 ms)
[10/27/2022-10:49:03] [I] Average on 10 runs - GPU latency: 22.9854 ms - Host latency: 23.1289 ms (enqueue 22.8964 ms)
[10/27/2022-10:49:03] [I] Average on 10 runs - GPU latency: 22.997 ms - Host latency: 23.143 ms (enqueue 22.9572 ms)
[10/27/2022-10:49:03] [I] Average on 10 runs - GPU latency: 22.9697 ms - Host latency: 23.1113 ms (enqueue 22.8876 ms)
[10/27/2022-10:49:03] [I] Average on 10 runs - GPU latency: 22.8642 ms - Host latency: 23.0036 ms (enqueue 22.8416 ms)
[10/27/2022-10:49:03] [I] Average on 10 runs - GPU latency: 22.9444 ms - Host latency: 23.0885 ms (enqueue 22.8449 ms)
[10/27/2022-10:49:03] [I] Average on 10 runs - GPU latency: 22.9925 ms - Host latency: 23.1364 ms (enqueue 22.9485 ms)
[10/27/2022-10:49:03] [I] Average on 10 runs - GPU latency: 22.8987 ms - Host latency: 23.0356 ms (enqueue 22.8807 ms)
[10/27/2022-10:49:03] [I] Average on 10 runs - GPU latency: 22.7963 ms - Host latency: 22.9571 ms (enqueue 22.7223 ms)
[10/27/2022-10:49:03] [I] 
[10/27/2022-10:49:03] [I] === Performance summary ===
[10/27/2022-10:49:03] [I] Throughput: 43.2493 qps
[10/27/2022-10:49:03] [I] Latency: min = 22.8521 ms, max = 23.998 ms, mean = 23.0956 ms, median = 22.9065 ms, percentile(99%) = 23.8329 ms
[10/27/2022-10:49:03] [I] Enqueue Time: min = 22.3805 ms, max = 23.8909 ms, mean = 22.8901 ms, median = 22.8617 ms, percentile(99%) = 23.592 ms
[10/27/2022-10:49:03] [I] H2D Latency: min = 0.120605 ms, max = 0.1875 ms, mean = 0.133684 ms, median = 0.125244 ms, percentile(99%) = 0.182373 ms
[10/27/2022-10:49:03] [I] GPU Compute Time: min = 22.7177 ms, max = 23.8459 ms, mean = 22.9511 ms, median = 22.7659 ms, percentile(99%) = 23.6998 ms
[10/27/2022-10:49:03] [I] D2H Latency: min = 0.00708008 ms, max = 0.013916 ms, mean = 0.0108492 ms, median = 0.0109253 ms, percentile(99%) = 0.0134277 ms
[10/27/2022-10:49:03] [I] Total Host Walltime: 3.0752 s
[10/27/2022-10:49:03] [I] Total GPU Compute Time: 3.05249 s
[10/27/2022-10:49:03] [W] * Throughput may be bound by Enqueue Time rather than GPU Compute and the GPU may be under-utilized.
[10/27/2022-10:49:03] [W]   If not already in use, --useCudaGraph (utilize CUDA graphs where possible) may increase the throughput.
[10/27/2022-10:49:03] [W] * GPU compute time is unstable, with coefficient of variance = 1.26437%.
[10/27/2022-10:49:03] [W]   If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[10/27/2022-10:49:03] [I] Explanations of the performance metrics are printed in the verbose logs.
[10/27/2022-10:49:03] [I] 
&&&& PASSED TensorRT.trtexec [TensorRT v8401] # /usr/src/tensorrt/bin/trtexec --deploy=resnet10.prototxt --model=resnet10.caffemodel --saveEngine=resnet10.dla.engine --fp16 --output=conv2d_bbox --useDLACore=0 --allowGPUFallback

then measure the inference time:

/opt/nvidia/deepstream/deepstream-6.1/samples/models/Primary_Detector$ trtexec --loadEngine=resnet10.dla.engine --dumpProfile
&&&& RUNNING TensorRT.trtexec [TensorRT v8401] # /usr/src/tensorrt/bin/trtexec --loadEngine=resnet10.dla.engine --dumpProfile
[10/27/2022-10:49:38] [I] === Model Options ===
[10/27/2022-10:49:38] [I] Format: *
[10/27/2022-10:49:38] [I] Model: 
[10/27/2022-10:49:38] [I] Output:
[10/27/2022-10:49:38] [I] === Build Options ===
[10/27/2022-10:49:38] [I] Max batch: 1
[10/27/2022-10:49:38] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default
[10/27/2022-10:49:38] [I] minTiming: 1
[10/27/2022-10:49:38] [I] avgTiming: 8
[10/27/2022-10:49:38] [I] Precision: FP32
[10/27/2022-10:49:38] [I] LayerPrecisions: 
[10/27/2022-10:49:38] [I] Calibration: 
[10/27/2022-10:49:38] [I] Refit: Disabled
[10/27/2022-10:49:38] [I] Sparsity: Disabled
[10/27/2022-10:49:38] [I] Safe mode: Disabled
[10/27/2022-10:49:38] [I] DirectIO mode: Disabled
[10/27/2022-10:49:38] [I] Restricted mode: Disabled
[10/27/2022-10:49:38] [I] Build only: Disabled
[10/27/2022-10:49:38] [I] Save engine: 
[10/27/2022-10:49:38] [I] Load engine: resnet10.dla.engine
[10/27/2022-10:49:38] [I] Profiling verbosity: 0
[10/27/2022-10:49:38] [I] Tactic sources: Using default tactic sources
[10/27/2022-10:49:38] [I] timingCacheMode: local
[10/27/2022-10:49:38] [I] timingCacheFile: 
[10/27/2022-10:49:38] [I] Input(s)s format: fp32:CHW
[10/27/2022-10:49:38] [I] Output(s)s format: fp32:CHW
[10/27/2022-10:49:38] [I] Input build shapes: model
[10/27/2022-10:49:38] [I] Input calibration shapes: model
[10/27/2022-10:49:38] [I] === System Options ===
[10/27/2022-10:49:38] [I] Device: 0
[10/27/2022-10:49:38] [I] DLACore: 
[10/27/2022-10:49:38] [I] Plugins:
[10/27/2022-10:49:38] [I] === Inference Options ===
[10/27/2022-10:49:38] [I] Batch: 1
[10/27/2022-10:49:38] [I] Input inference shapes: model
[10/27/2022-10:49:38] [I] Iterations: 10
[10/27/2022-10:49:38] [I] Duration: 3s (+ 200ms warm up)
[10/27/2022-10:49:38] [I] Sleep time: 0ms
[10/27/2022-10:49:38] [I] Idle time: 0ms
[10/27/2022-10:49:38] [I] Streams: 1
[10/27/2022-10:49:38] [I] ExposeDMA: Disabled
[10/27/2022-10:49:38] [I] Data transfers: Enabled
[10/27/2022-10:49:38] [I] Spin-wait: Disabled
[10/27/2022-10:49:38] [I] Multithreading: Disabled
[10/27/2022-10:49:38] [I] CUDA Graph: Disabled
[10/27/2022-10:49:38] [I] Separate profiling: Disabled
[10/27/2022-10:49:38] [I] Time Deserialize: Disabled
[10/27/2022-10:49:38] [I] Time Refit: Disabled
[10/27/2022-10:49:38] [I] Inputs:
[10/27/2022-10:49:38] [I] === Reporting Options ===
[10/27/2022-10:49:38] [I] Verbose: Disabled
[10/27/2022-10:49:38] [I] Averages: 10 inferences
[10/27/2022-10:49:38] [I] Percentile: 99
[10/27/2022-10:49:38] [I] Dump refittable layers:Disabled
[10/27/2022-10:49:38] [I] Dump output: Disabled
[10/27/2022-10:49:38] [I] Profile: Enabled
[10/27/2022-10:49:38] [I] Export timing to JSON file: 
[10/27/2022-10:49:38] [I] Export output to JSON file: 
[10/27/2022-10:49:38] [I] Export profile to JSON file: 
[10/27/2022-10:49:38] [I] 
[10/27/2022-10:49:39] [I] === Device Information ===
[10/27/2022-10:49:39] [I] Selected Device: Orin
[10/27/2022-10:49:39] [I] Compute Capability: 8.7
[10/27/2022-10:49:39] [I] SMs: 16
[10/27/2022-10:49:39] [I] Compute Clock Rate: 1.3 GHz
[10/27/2022-10:49:39] [I] Device Global Memory: 30535 MiB
[10/27/2022-10:49:39] [I] Shared Memory per SM: 164 KiB
[10/27/2022-10:49:39] [I] Memory Bus Width: 128 bits (ECC disabled)
[10/27/2022-10:49:39] [I] Memory Clock Rate: 1.3 GHz
[10/27/2022-10:49:39] [I] 
[10/27/2022-10:49:39] [I] TensorRT version: 8.4.1
[10/27/2022-10:49:39] [I] Engine loaded in 0.003854 sec.
[10/27/2022-10:49:39] [I] [TRT] [MemUsageChange] Init CUDA: CPU +218, GPU +0, now: CPU 245, GPU 8271 (MiB)
[10/27/2022-10:49:39] [I] [TRT] Loaded engine size: 3 MiB
[10/27/2022-10:49:39] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +3, GPU +0, now: CPU 3, GPU 0 (MiB)
[10/27/2022-10:49:39] [I] Engine deserialized in 0.525183 sec.
[10/27/2022-10:49:39] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +8, now: CPU 3, GPU 8 (MiB)
[10/27/2022-10:49:39] [I] Using random values for input input_1
[10/27/2022-10:49:39] [I] Created input binding for input_1 with dimensions 3x368x640
[10/27/2022-10:49:39] [I] Using random values for output conv2d_bbox
[10/27/2022-10:49:39] [I] Created output binding for conv2d_bbox with dimensions 16x23x40
[10/27/2022-10:49:39] [I] Starting inference
[10/27/2022-10:49:42] [I] The e2e network timing is not reported since it is inaccurate due to the extra synchronizations when the profiler is enabled.
[10/27/2022-10:49:42] [I] To show e2e network timing report, add --separateProfileRun to profile layer timing in a separate run or remove --dumpProfile to disable the profiler.
[10/27/2022-10:49:42] [I] 
[10/27/2022-10:49:42] [I] === Profile (139 iterations ) ===
[10/27/2022-10:49:42] [I]                                                                                      Layer   Time (ms)   Avg. Time (ms)   Median Time (ms)   Time %
[10/27/2022-10:49:42] [I]                                                                             input_1 to nvm       12.62           0.0908             0.0903      0.4
[10/27/2022-10:49:42] [I]  Reformatting CopyNode for Input Tensor 0 to {ForeignNode[conv1 + bn_conv1...conv2d_bbox]}       54.65           0.3932             0.3930      1.7
[10/27/2022-10:49:42] [I]                                              {ForeignNode[conv1 + bn_conv1...conv2d_bbox]}        4.31           0.0310             0.0307      0.1
[10/27/2022-10:49:42] [I]                                                                       conv2d_bbox from nvm     3132.38          22.5351            22.5350     97.7
[10/27/2022-10:49:42] [I]         Reformatted Input Tensor 0 to {ForeignNode[conv1 + bn_conv1...conv2d_bbox]} finish        0.59           0.0043             0.0042      0.0
[10/27/2022-10:49:42] [I]                                                                    conv2d_bbox copy finish        0.90           0.0065             0.0064      0.0
[10/27/2022-10:49:42] [I]                                                                                      Total     3205.45          23.0608            23.0599    100.0
[10/27/2022-10:49:42] [I] 
&&&& PASSED TensorRT.trtexec [TensorRT v8401] # /usr/src/tensorrt/bin/trtexec --loadEngine=resnet10.dla.engine --dumpProfile

Please add “–useDLACore=0 --allowGPUFallback” options

thank you,
when the cause is batch size, after set batch it’s success,but infer time is 140ms

/opt/nvidia/deepstream/deepstream-6.1/samples/models/Primary_Detector$ trtexec --loadEngine=resnet10.caffemodel_b6_dla0_fp16.engine --useDLACore=0 --allowGPUFallback --batch=6
&&&& RUNNING TensorRT.trtexec [TensorRT v8401] # /usr/src/tensorrt/bin/trtexec --loadEngine=resnet10.caffemodel_b6_dla0_fp16.engine --useDLACore=0 --allowGPUFallback --batch=6
[10/27/2022-12:03:55] [I] === Model Options ===
[10/27/2022-12:03:55] [I] Format: *
[10/27/2022-12:03:55] [I] Model: 
[10/27/2022-12:03:55] [I] Output:
[10/27/2022-12:03:55] [I] === Build Options ===
[10/27/2022-12:03:55] [I] Max batch: 6
[10/27/2022-12:03:55] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default
[10/27/2022-12:03:55] [I] minTiming: 1
[10/27/2022-12:03:55] [I] avgTiming: 8
[10/27/2022-12:03:55] [I] Precision: FP32
[10/27/2022-12:03:55] [I] LayerPrecisions: 
[10/27/2022-12:03:55] [I] Calibration: 
[10/27/2022-12:03:55] [I] Refit: Disabled
[10/27/2022-12:03:55] [I] Sparsity: Disabled
[10/27/2022-12:03:55] [I] Safe mode: Disabled
[10/27/2022-12:03:55] [I] DirectIO mode: Disabled
[10/27/2022-12:03:55] [I] Restricted mode: Disabled
[10/27/2022-12:03:55] [I] Build only: Disabled
[10/27/2022-12:03:55] [I] Save engine: 
[10/27/2022-12:03:55] [I] Load engine: resnet10.caffemodel_b6_dla0_fp16.engine
[10/27/2022-12:03:55] [I] Profiling verbosity: 0
[10/27/2022-12:03:55] [I] Tactic sources: Using default tactic sources
[10/27/2022-12:03:55] [I] timingCacheMode: local
[10/27/2022-12:03:55] [I] timingCacheFile: 
[10/27/2022-12:03:55] [I] Input(s)s format: fp32:CHW
[10/27/2022-12:03:55] [I] Output(s)s format: fp32:CHW
[10/27/2022-12:03:55] [I] Input build shapes: model
[10/27/2022-12:03:55] [I] Input calibration shapes: model
[10/27/2022-12:03:55] [I] === System Options ===
[10/27/2022-12:03:55] [I] Device: 0
[10/27/2022-12:03:55] [I] DLACore: 0(With GPU fallback)
[10/27/2022-12:03:55] [I] Plugins:
[10/27/2022-12:03:55] [I] === Inference Options ===
[10/27/2022-12:03:55] [I] Batch: 6
[10/27/2022-12:03:55] [I] Input inference shapes: model
[10/27/2022-12:03:55] [I] Iterations: 10
[10/27/2022-12:03:55] [I] Duration: 3s (+ 200ms warm up)
[10/27/2022-12:03:55] [I] Sleep time: 0ms
[10/27/2022-12:03:55] [I] Idle time: 0ms
[10/27/2022-12:03:55] [I] Streams: 1
[10/27/2022-12:03:55] [I] ExposeDMA: Disabled
[10/27/2022-12:03:55] [I] Data transfers: Enabled
[10/27/2022-12:03:55] [I] Spin-wait: Disabled
[10/27/2022-12:03:55] [I] Multithreading: Disabled
[10/27/2022-12:03:55] [I] CUDA Graph: Disabled
[10/27/2022-12:03:55] [I] Separate profiling: Disabled
[10/27/2022-12:03:55] [I] Time Deserialize: Disabled
[10/27/2022-12:03:55] [I] Time Refit: Disabled
[10/27/2022-12:03:55] [I] Inputs:
[10/27/2022-12:03:55] [I] === Reporting Options ===
[10/27/2022-12:03:55] [I] Verbose: Disabled
[10/27/2022-12:03:55] [I] Averages: 10 inferences
[10/27/2022-12:03:55] [I] Percentile: 99
[10/27/2022-12:03:55] [I] Dump refittable layers:Disabled
[10/27/2022-12:03:55] [I] Dump output: Disabled
[10/27/2022-12:03:55] [I] Profile: Disabled
[10/27/2022-12:03:55] [I] Export timing to JSON file: 
[10/27/2022-12:03:55] [I] Export output to JSON file: 
[10/27/2022-12:03:55] [I] Export profile to JSON file: 
[10/27/2022-12:03:55] [I] 
[10/27/2022-12:03:55] [I] === Device Information ===
[10/27/2022-12:03:55] [I] Selected Device: Orin
[10/27/2022-12:03:55] [I] Compute Capability: 8.7
[10/27/2022-12:03:55] [I] SMs: 16
[10/27/2022-12:03:55] [I] Compute Clock Rate: 1.3 GHz
[10/27/2022-12:03:55] [I] Device Global Memory: 30535 MiB
[10/27/2022-12:03:55] [I] Shared Memory per SM: 164 KiB
[10/27/2022-12:03:55] [I] Memory Bus Width: 128 bits (ECC disabled)
[10/27/2022-12:03:55] [I] Memory Clock Rate: 1.3 GHz
[10/27/2022-12:03:55] [I] 
[10/27/2022-12:03:55] [I] TensorRT version: 8.4.1
[10/27/2022-12:03:55] [I] Engine loaded in 0.00328999 sec.
[10/27/2022-12:03:55] [I] [TRT] [MemUsageChange] Init CUDA: CPU +218, GPU +0, now: CPU 245, GPU 8325 (MiB)
[10/27/2022-12:03:55] [I] [TRT] Loaded engine size: 3 MiB
[10/27/2022-12:03:55] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +3, GPU +0, now: CPU 3, GPU 0 (MiB)
[10/27/2022-12:03:55] [I] Engine deserialized in 0.504519 sec.
[10/27/2022-12:03:55] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +43, now: CPU 3, GPU 43 (MiB)
[10/27/2022-12:03:55] [I] Using random values for input input_1
[10/27/2022-12:03:55] [I] Created input binding for input_1 with dimensions 3x368x640
[10/27/2022-12:03:55] [I] Using random values for output conv2d_bbox
[10/27/2022-12:03:55] [I] Created output binding for conv2d_bbox with dimensions 16x23x40
[10/27/2022-12:03:55] [I] Using random values for output conv2d_cov/Sigmoid
[10/27/2022-12:03:55] [I] Created output binding for conv2d_cov/Sigmoid with dimensions 4x23x40
[10/27/2022-12:03:55] [I] Starting inference
[10/27/2022-12:03:59] [I] Warmup completed 12 queries over 200 ms
[10/27/2022-12:03:59] [I] Timing trace has 138 queries over 3.38284 s
[10/27/2022-12:03:59] [I] 
[10/27/2022-12:03:59] [I] === Trace details ===
[10/27/2022-12:03:59] [I] Trace averages of 10 runs:
[10/27/2022-12:03:59] [I] Average on 10 runs - GPU latency: 141.248 ms - Host latency: 143.295 ms (enqueue 141.941 ms)
[10/27/2022-12:03:59] [I] Average on 10 runs - GPU latency: 140.286 ms - Host latency: 142.339 ms (enqueue 140.231 ms)
[10/27/2022-12:03:59] [I] 
[10/27/2022-12:03:59] [I] === Performance summary ===
[10/27/2022-12:03:59] [I] Throughput: 40.7941 qps
[10/27/2022-12:03:59] [I] Latency: min = 142.132 ms, max = 153.626 ms, mean = 142.743 ms, median = 142.157 ms, percentile(99%) = 153.626 ms
[10/27/2022-12:03:59] [I] Enqueue Time: min = 139.574 ms, max = 151.627 ms, mean = 140.97 ms, median = 140.253 ms, percentile(99%) = 151.627 ms
[10/27/2022-12:03:59] [I] H2D Latency: min = 1.97095 ms, max = 1.99023 ms, mean = 1.97807 ms, median = 1.97168 ms, percentile(99%) = 1.99023 ms
[10/27/2022-12:03:59] [I] GPU Compute Time: min = 140.088 ms, max = 151.578 ms, mean = 140.692 ms, median = 140.106 ms, percentile(99%) = 151.578 ms
[10/27/2022-12:03:59] [I] D2H Latency: min = 0.0637207 ms, max = 0.0734863 ms, mean = 0.0722417 ms, median = 0.0726929 ms, percentile(99%) = 0.0734863 ms
[10/27/2022-12:03:59] [I] Total Host Walltime: 3.38284 s
[10/27/2022-12:03:59] [I] Total GPU Compute Time: 3.23592 s
[10/27/2022-12:03:59] [W] * Throughput may be bound by Enqueue Time rather than GPU Compute and the GPU may be under-utilized.
[10/27/2022-12:03:59] [W]   If not already in use, --useCudaGraph (utilize CUDA graphs where possible) may increase the throughput.
[10/27/2022-12:03:59] [W] * GPU compute time is unstable, with coefficient of variance = 1.65519%.
[10/27/2022-12:03:59] [W]   If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[10/27/2022-12:03:59] [I] Explanations of the performance metrics are printed in the verbose logs.
[10/27/2022-12:03:59] [I] 
&&&& PASSED TensorRT.trtexec [TensorRT v8401] # /usr/src/tensorrt/bin/trtexec --loadEngine=resnet10.caffemodel_b6_dla0_fp16.engine --useDLACore=0 --allowGPUFallback --batch=6
/opt/nvidia/deepstream/deepstream-6.1/samples/models/Primary_Detector$ trtexec --loadEngine=resnet10.caffemodel_b6_dla0_fp16.engine --useDLACore=0 --allowGPUFallback --batch=6 --dumpProfile
&&&& RUNNING TensorRT.trtexec [TensorRT v8401] # /usr/src/tensorrt/bin/trtexec --loadEngine=resnet10.caffemodel_b6_dla0_fp16.engine --useDLACore=0 --allowGPUFallback --batch=6 --dumpProfile
[10/27/2022-12:04:11] [I] === Model Options ===
[10/27/2022-12:04:11] [I] Format: *
[10/27/2022-12:04:11] [I] Model: 
[10/27/2022-12:04:11] [I] Output:
[10/27/2022-12:04:11] [I] === Build Options ===
[10/27/2022-12:04:11] [I] Max batch: 6
[10/27/2022-12:04:11] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default
[10/27/2022-12:04:11] [I] minTiming: 1
[10/27/2022-12:04:11] [I] avgTiming: 8
[10/27/2022-12:04:11] [I] Precision: FP32
[10/27/2022-12:04:11] [I] LayerPrecisions: 
[10/27/2022-12:04:11] [I] Calibration: 
[10/27/2022-12:04:11] [I] Refit: Disabled
[10/27/2022-12:04:11] [I] Sparsity: Disabled
[10/27/2022-12:04:11] [I] Safe mode: Disabled
[10/27/2022-12:04:11] [I] DirectIO mode: Disabled
[10/27/2022-12:04:11] [I] Restricted mode: Disabled
[10/27/2022-12:04:11] [I] Build only: Disabled
[10/27/2022-12:04:11] [I] Save engine: 
[10/27/2022-12:04:11] [I] Load engine: resnet10.caffemodel_b6_dla0_fp16.engine
[10/27/2022-12:04:11] [I] Profiling verbosity: 0
[10/27/2022-12:04:11] [I] Tactic sources: Using default tactic sources
[10/27/2022-12:04:11] [I] timingCacheMode: local
[10/27/2022-12:04:11] [I] timingCacheFile: 
[10/27/2022-12:04:11] [I] Input(s)s format: fp32:CHW
[10/27/2022-12:04:11] [I] Output(s)s format: fp32:CHW
[10/27/2022-12:04:11] [I] Input build shapes: model
[10/27/2022-12:04:11] [I] Input calibration shapes: model
[10/27/2022-12:04:11] [I] === System Options ===
[10/27/2022-12:04:11] [I] Device: 0
[10/27/2022-12:04:11] [I] DLACore: 0(With GPU fallback)
[10/27/2022-12:04:11] [I] Plugins:
[10/27/2022-12:04:11] [I] === Inference Options ===
[10/27/2022-12:04:11] [I] Batch: 6
[10/27/2022-12:04:11] [I] Input inference shapes: model
[10/27/2022-12:04:11] [I] Iterations: 10
[10/27/2022-12:04:11] [I] Duration: 3s (+ 200ms warm up)
[10/27/2022-12:04:11] [I] Sleep time: 0ms
[10/27/2022-12:04:11] [I] Idle time: 0ms
[10/27/2022-12:04:11] [I] Streams: 1
[10/27/2022-12:04:11] [I] ExposeDMA: Disabled
[10/27/2022-12:04:11] [I] Data transfers: Enabled
[10/27/2022-12:04:11] [I] Spin-wait: Disabled
[10/27/2022-12:04:11] [I] Multithreading: Disabled
[10/27/2022-12:04:11] [I] CUDA Graph: Disabled
[10/27/2022-12:04:11] [I] Separate profiling: Disabled
[10/27/2022-12:04:11] [I] Time Deserialize: Disabled
[10/27/2022-12:04:11] [I] Time Refit: Disabled
[10/27/2022-12:04:11] [I] Inputs:
[10/27/2022-12:04:11] [I] === Reporting Options ===
[10/27/2022-12:04:11] [I] Verbose: Disabled
[10/27/2022-12:04:11] [I] Averages: 10 inferences
[10/27/2022-12:04:11] [I] Percentile: 99
[10/27/2022-12:04:11] [I] Dump refittable layers:Disabled
[10/27/2022-12:04:11] [I] Dump output: Disabled
[10/27/2022-12:04:11] [I] Profile: Enabled
[10/27/2022-12:04:11] [I] Export timing to JSON file: 
[10/27/2022-12:04:11] [I] Export output to JSON file: 
[10/27/2022-12:04:11] [I] Export profile to JSON file: 
[10/27/2022-12:04:11] [I] 
[10/27/2022-12:04:12] [I] === Device Information ===
[10/27/2022-12:04:12] [I] Selected Device: Orin
[10/27/2022-12:04:12] [I] Compute Capability: 8.7
[10/27/2022-12:04:12] [I] SMs: 16
[10/27/2022-12:04:12] [I] Compute Clock Rate: 1.3 GHz
[10/27/2022-12:04:12] [I] Device Global Memory: 30535 MiB
[10/27/2022-12:04:12] [I] Shared Memory per SM: 164 KiB
[10/27/2022-12:04:12] [I] Memory Bus Width: 128 bits (ECC disabled)
[10/27/2022-12:04:12] [I] Memory Clock Rate: 1.3 GHz
[10/27/2022-12:04:12] [I] 
[10/27/2022-12:04:12] [I] TensorRT version: 8.4.1
[10/27/2022-12:04:12] [I] Engine loaded in 0.00380333 sec.
[10/27/2022-12:04:12] [I] [TRT] [MemUsageChange] Init CUDA: CPU +218, GPU +0, now: CPU 245, GPU 8319 (MiB)
[10/27/2022-12:04:12] [I] [TRT] Loaded engine size: 3 MiB
[10/27/2022-12:04:12] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +3, GPU +0, now: CPU 3, GPU 0 (MiB)
[10/27/2022-12:04:12] [I] Engine deserialized in 0.535851 sec.
[10/27/2022-12:04:12] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +43, now: CPU 3, GPU 43 (MiB)
[10/27/2022-12:04:12] [I] Using random values for input input_1
[10/27/2022-12:04:12] [I] Created input binding for input_1 with dimensions 3x368x640
[10/27/2022-12:04:12] [I] Using random values for output conv2d_bbox
[10/27/2022-12:04:12] [I] Created output binding for conv2d_bbox with dimensions 16x23x40
[10/27/2022-12:04:12] [I] Using random values for output conv2d_cov/Sigmoid
[10/27/2022-12:04:12] [I] Created output binding for conv2d_cov/Sigmoid with dimensions 4x23x40
[10/27/2022-12:04:12] [I] Starting inference
[10/27/2022-12:04:16] [I] The e2e network timing is not reported since it is inaccurate due to the extra synchronizations when the profiler is enabled.
[10/27/2022-12:04:16] [I] To show e2e network timing report, add --separateProfileRun to profile layer timing in a separate run or remove --dumpProfile to disable the profiler.
[10/27/2022-12:04:16] [I] 
[10/27/2022-12:04:16] [I] === Profile (25 iterations ) ===
[10/27/2022-12:04:16] [I]                                                 Layer   Time (ms)   Avg. Time (ms)   Median Time (ms)   Time %
[10/27/2022-12:04:16] [I]                                        input_1 to nvm       75.02           3.0009             2.9981      2.1
[10/27/2022-12:04:16] [I]  {ForeignNode[conv1 + bn_conv1...conv2d_cov/Sigmoid]}        0.75           0.0301             0.0299      0.0
[10/27/2022-12:04:16] [I]                           conv2d_cov/Sigmoid from nvm     3425.06         137.0022           137.0020     97.8
[10/27/2022-12:04:16] [I]                                   input_1 copy finish        0.11           0.0046             0.0045      0.0
[10/27/2022-12:04:16] [I]                        conv2d_cov/Sigmoid copy finish        0.17           0.0068             0.0070      0.0
[10/27/2022-12:04:16] [I]                                  conv2d_bbox from nvm        0.98           0.0392             0.0393      0.0
[10/27/2022-12:04:16] [I]                               conv2d_bbox copy finish        0.29           0.0115             0.0113      0.0
[10/27/2022-12:04:16] [I]                                                 Total     3502.38         140.0952           140.0916    100.0
[10/27/2022-12:04:16] [I] 
&&&& PASSED TensorRT.trtexec [TensorRT v8401] # /usr/src/tensorrt/bin/trtexec --loadEngine=resnet10.caffemodel_b6_dla0_fp16.engine --useDLACore=0 --allowGPUFallback --batch=6 --dumpProfile


Please test the same command in both Orin and Xavier and compare the performance.

the result of test with xavier:

/opt/nvidia/deepstream/deepstream-6.1/samples/models/Primary_Detector$ trtexec --loadEngine=resnet10.caffemodel_b6_dla0_fp16.engine --useDLACore=0 --allowGPUFallback --batch=6
&&&& RUNNING TensorRT.trtexec [TensorRT v8401] # /usr/src/tensorrt/bin/trtexec --loadEngine=resnet10.caffemodel_b6_dla0_fp16.engine --useDLACore=0 --allowGPUFallback --batch=6
[10/27/2022-13:44:44] [I] === Model Options ===
[10/27/2022-13:44:44] [I] Format: *
[10/27/2022-13:44:44] [I] Model: 
[10/27/2022-13:44:44] [I] Output:
[10/27/2022-13:44:44] [I] === Build Options ===
[10/27/2022-13:44:44] [I] Max batch: 6
[10/27/2022-13:44:44] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default
[10/27/2022-13:44:44] [I] minTiming: 1
[10/27/2022-13:44:44] [I] avgTiming: 8
[10/27/2022-13:44:44] [I] Precision: FP32
[10/27/2022-13:44:44] [I] LayerPrecisions: 
[10/27/2022-13:44:44] [I] Calibration: 
[10/27/2022-13:44:44] [I] Refit: Disabled
[10/27/2022-13:44:44] [I] Sparsity: Disabled
[10/27/2022-13:44:44] [I] Safe mode: Disabled
[10/27/2022-13:44:44] [I] DirectIO mode: Disabled
[10/27/2022-13:44:44] [I] Restricted mode: Disabled
[10/27/2022-13:44:44] [I] Build only: Disabled
[10/27/2022-13:44:44] [I] Save engine: 
[10/27/2022-13:44:44] [I] Load engine: resnet10.caffemodel_b6_dla0_fp16.engine
[10/27/2022-13:44:44] [I] Profiling verbosity: 0
[10/27/2022-13:44:44] [I] Tactic sources: Using default tactic sources
[10/27/2022-13:44:44] [I] timingCacheMode: local
[10/27/2022-13:44:44] [I] timingCacheFile: 
[10/27/2022-13:44:44] [I] Input(s)s format: fp32:CHW
[10/27/2022-13:44:44] [I] Output(s)s format: fp32:CHW
[10/27/2022-13:44:44] [I] Input build shapes: model
[10/27/2022-13:44:44] [I] Input calibration shapes: model
[10/27/2022-13:44:44] [I] === System Options ===
[10/27/2022-13:44:44] [I] Device: 0
[10/27/2022-13:44:44] [I] DLACore: 0(With GPU fallback)
[10/27/2022-13:44:44] [I] Plugins:
[10/27/2022-13:44:44] [I] === Inference Options ===
[10/27/2022-13:44:44] [I] Batch: 6
[10/27/2022-13:44:44] [I] Input inference shapes: model
[10/27/2022-13:44:44] [I] Iterations: 10
[10/27/2022-13:44:44] [I] Duration: 3s (+ 200ms warm up)
[10/27/2022-13:44:44] [I] Sleep time: 0ms
[10/27/2022-13:44:44] [I] Idle time: 0ms
[10/27/2022-13:44:44] [I] Streams: 1
[10/27/2022-13:44:44] [I] ExposeDMA: Disabled
[10/27/2022-13:44:44] [I] Data transfers: Enabled
[10/27/2022-13:44:44] [I] Spin-wait: Disabled
[10/27/2022-13:44:44] [I] Multithreading: Disabled
[10/27/2022-13:44:44] [I] CUDA Graph: Disabled
[10/27/2022-13:44:44] [I] Separate profiling: Disabled
[10/27/2022-13:44:44] [I] Time Deserialize: Disabled
[10/27/2022-13:44:44] [I] Time Refit: Disabled
[10/27/2022-13:44:44] [I] Inputs:
[10/27/2022-13:44:44] [I] === Reporting Options ===
[10/27/2022-13:44:44] [I] Verbose: Disabled
[10/27/2022-13:44:44] [I] Averages: 10 inferences
[10/27/2022-13:44:44] [I] Percentile: 99
[10/27/2022-13:44:44] [I] Dump refittable layers:Disabled
[10/27/2022-13:44:44] [I] Dump output: Disabled
[10/27/2022-13:44:44] [I] Profile: Disabled
[10/27/2022-13:44:44] [I] Export timing to JSON file: 
[10/27/2022-13:44:44] [I] Export output to JSON file: 
[10/27/2022-13:44:44] [I] Export profile to JSON file: 
[10/27/2022-13:44:44] [I] 
[10/27/2022-13:44:44] [I] === Device Information ===
[10/27/2022-13:44:44] [I] Selected Device: Xavier
[10/27/2022-13:44:44] [I] Compute Capability: 7.2
[10/27/2022-13:44:44] [I] SMs: 8
[10/27/2022-13:44:44] [I] Compute Clock Rate: 1.377 GHz
[10/27/2022-13:44:44] [I] Device Global Memory: 31011 MiB
[10/27/2022-13:44:44] [I] Shared Memory per SM: 96 KiB
[10/27/2022-13:44:44] [I] Memory Bus Width: 256 bits (ECC disabled)
[10/27/2022-13:44:44] [I] Memory Clock Rate: 1.377 GHz
[10/27/2022-13:44:44] [I] 
[10/27/2022-13:44:44] [I] TensorRT version: 8.4.1
[10/27/2022-13:44:44] [I] Engine loaded in 0.013447 sec.
[10/27/2022-13:44:45] [I] [TRT] [MemUsageChange] Init CUDA: CPU +186, GPU +0, now: CPU 215, GPU 3473 (MiB)
[10/27/2022-13:44:45] [I] [TRT] Loaded engine size: 5 MiB
[10/27/2022-13:44:45] [W] [TRT] Using an engine plan file across different models of devices is not recommended and is likely to affect performance or even cause errors.
[10/27/2022-13:44:45] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +5, GPU +0, now: CPU 5, GPU 0 (MiB)
[10/27/2022-13:44:45] [I] Engine deserialized in 1.08134 sec.
[10/27/2022-13:44:45] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +10, now: CPU 5, GPU 10 (MiB)
[10/27/2022-13:44:45] [I] Using random values for input input_1
[10/27/2022-13:44:45] [I] Created input binding for input_1 with dimensions 3x368x640
[10/27/2022-13:44:45] [I] Using random values for output conv2d_bbox
[10/27/2022-13:44:45] [I] Created output binding for conv2d_bbox with dimensions 16x23x40
[10/27/2022-13:44:45] [I] Using random values for output conv2d_cov/Sigmoid
[10/27/2022-13:44:45] [I] Created output binding for conv2d_cov/Sigmoid with dimensions 4x23x40
[10/27/2022-13:44:45] [I] Starting inference
[10/27/2022-13:44:48] [I] Warmup completed 48 queries over 200 ms
[10/27/2022-13:44:48] [I] Timing trace has 696 queries over 3.09009 s
[10/27/2022-13:44:48] [I] 
[10/27/2022-13:44:48] [I] === Trace details ===
[10/27/2022-13:44:48] [I] Trace averages of 10 runs:
[10/27/2022-13:44:48] [I] Average on 10 runs - GPU latency: 25.7475 ms - Host latency: 31.2928 ms (enqueue 25.6934 ms)
[10/27/2022-13:44:48] [I] Average on 10 runs - GPU latency: 25.747 ms - Host latency: 31.2905 ms (enqueue 25.6388 ms)
[10/27/2022-13:44:48] [I] Average on 10 runs - GPU latency: 26.0922 ms - Host latency: 31.7145 ms (enqueue 25.6858 ms)
[10/27/2022-13:44:48] [I] Average on 10 runs - GPU latency: 26.5996 ms - Host latency: 32.4891 ms (enqueue 25.7648 ms)
[10/27/2022-13:44:48] [I] Average on 10 runs - GPU latency: 26.6047 ms - Host latency: 32.493 ms (enqueue 25.7774 ms)
[10/27/2022-13:44:48] [I] Average on 10 runs - GPU latency: 26.6083 ms - Host latency: 32.4983 ms (enqueue 25.7688 ms)
[10/27/2022-13:44:48] [I] Average on 10 runs - GPU latency: 26.611 ms - Host latency: 32.4968 ms (enqueue 25.7736 ms)
[10/27/2022-13:44:48] [I] Average on 10 runs - GPU latency: 26.6044 ms - Host latency: 32.4913 ms (enqueue 25.7968 ms)
[10/27/2022-13:44:48] [I] Average on 10 runs - GPU latency: 26.6033 ms - Host latency: 32.4977 ms (enqueue 25.7831 ms)
[10/27/2022-13:44:48] [I] Average on 10 runs - GPU latency: 26.6115 ms - Host latency: 32.5029 ms (enqueue 25.775 ms)
[10/27/2022-13:44:48] [I] Average on 10 runs - GPU latency: 26.6026 ms - Host latency: 32.4939 ms (enqueue 25.7536 ms)
[10/27/2022-13:44:48] [I] 
[10/27/2022-13:44:48] [I] === Performance summary ===
[10/27/2022-13:44:48] [I] Throughput: 225.236 qps
[10/27/2022-13:44:48] [I] Latency: min = 31.274 ms, max = 32.5283 ms, mean = 32.2203 ms, median = 32.4907 ms, percentile(99%) = 32.5222 ms
[10/27/2022-13:44:48] [I] Enqueue Time: min = 25.2906 ms, max = 26.0891 ms, mean = 25.7479 ms, median = 25.7297 ms, percentile(99%) = 26.0796 ms
[10/27/2022-13:44:48] [I] H2D Latency: min = 5.35971 ms, max = 5.71753 ms, mean = 5.62184 ms, median = 5.69928 ms, percentile(99%) = 5.71399 ms
[10/27/2022-13:44:48] [I] GPU Compute Time: min = 25.7349 ms, max = 26.6572 ms, mean = 26.4133 ms, median = 26.6022 ms, percentile(99%) = 26.6284 ms
[10/27/2022-13:44:48] [I] D2H Latency: min = 0.166016 ms, max = 0.197754 ms, mean = 0.185153 ms, median = 0.185242 ms, percentile(99%) = 0.196899 ms
[10/27/2022-13:44:48] [I] Total Host Walltime: 3.09009 s
[10/27/2022-13:44:48] [I] Total GPU Compute Time: 3.06394 s
[10/27/2022-13:44:48] [W] * Throughput may be bound by Enqueue Time rather than GPU Compute and the GPU may be under-utilized.
[10/27/2022-13:44:48] [W]   If not already in use, --useCudaGraph (utilize CUDA graphs where possible) may increase the throughput.
[10/27/2022-13:44:48] [W] * GPU compute time is unstable, with coefficient of variance = 1.3579%.
[10/27/2022-13:44:48] [W]   If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[10/27/2022-13:44:48] [I] Explanations of the performance metrics are printed in the verbose logs.
[10/27/2022-13:44:48] [I] 
&&&& PASSED TensorRT.trtexec [TensorRT v8401] # /usr/src/tensorrt/bin/trtexec --loadEngine=resnet10.caffemodel_b6_dla0_fp16.engine --useDLACore=0 --allowGPUFallback --batch=6
/opt/nvidia/deepstream/deepstream-6.1/samples/models/Primary_Detector$ trtexec --loadEngine=resnet10.caffemodel_b6_dla0_fp16.engine --useDLACore=0 --allowGPUFallback --batch=6 --dumpProfile
&&&& RUNNING TensorRT.trtexec [TensorRT v8401] # /usr/src/tensorrt/bin/trtexec --loadEngine=resnet10.caffemodel_b6_dla0_fp16.engine --useDLACore=0 --allowGPUFallback --batch=6 --dumpProfile
[10/27/2022-13:44:51] [I] === Model Options ===
[10/27/2022-13:44:51] [I] Format: *
[10/27/2022-13:44:51] [I] Model: 
[10/27/2022-13:44:51] [I] Output:
[10/27/2022-13:44:51] [I] === Build Options ===
[10/27/2022-13:44:51] [I] Max batch: 6
[10/27/2022-13:44:51] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default
[10/27/2022-13:44:51] [I] minTiming: 1
[10/27/2022-13:44:51] [I] avgTiming: 8
[10/27/2022-13:44:51] [I] Precision: FP32
[10/27/2022-13:44:51] [I] LayerPrecisions: 
[10/27/2022-13:44:51] [I] Calibration: 
[10/27/2022-13:44:51] [I] Refit: Disabled
[10/27/2022-13:44:51] [I] Sparsity: Disabled
[10/27/2022-13:44:51] [I] Safe mode: Disabled
[10/27/2022-13:44:51] [I] DirectIO mode: Disabled
[10/27/2022-13:44:51] [I] Restricted mode: Disabled
[10/27/2022-13:44:51] [I] Build only: Disabled
[10/27/2022-13:44:51] [I] Save engine: 
[10/27/2022-13:44:51] [I] Load engine: resnet10.caffemodel_b6_dla0_fp16.engine
[10/27/2022-13:44:51] [I] Profiling verbosity: 0
[10/27/2022-13:44:51] [I] Tactic sources: Using default tactic sources
[10/27/2022-13:44:51] [I] timingCacheMode: local
[10/27/2022-13:44:51] [I] timingCacheFile: 
[10/27/2022-13:44:51] [I] Input(s)s format: fp32:CHW
[10/27/2022-13:44:51] [I] Output(s)s format: fp32:CHW
[10/27/2022-13:44:51] [I] Input build shapes: model
[10/27/2022-13:44:51] [I] Input calibration shapes: model
[10/27/2022-13:44:51] [I] === System Options ===
[10/27/2022-13:44:51] [I] Device: 0
[10/27/2022-13:44:51] [I] DLACore: 0(With GPU fallback)
[10/27/2022-13:44:51] [I] Plugins:
[10/27/2022-13:44:51] [I] === Inference Options ===
[10/27/2022-13:44:51] [I] Batch: 6
[10/27/2022-13:44:51] [I] Input inference shapes: model
[10/27/2022-13:44:51] [I] Iterations: 10
[10/27/2022-13:44:51] [I] Duration: 3s (+ 200ms warm up)
[10/27/2022-13:44:51] [I] Sleep time: 0ms
[10/27/2022-13:44:51] [I] Idle time: 0ms
[10/27/2022-13:44:51] [I] Streams: 1
[10/27/2022-13:44:51] [I] ExposeDMA: Disabled
[10/27/2022-13:44:51] [I] Data transfers: Enabled
[10/27/2022-13:44:51] [I] Spin-wait: Disabled
[10/27/2022-13:44:51] [I] Multithreading: Disabled
[10/27/2022-13:44:51] [I] CUDA Graph: Disabled
[10/27/2022-13:44:51] [I] Separate profiling: Disabled
[10/27/2022-13:44:51] [I] Time Deserialize: Disabled
[10/27/2022-13:44:51] [I] Time Refit: Disabled
[10/27/2022-13:44:51] [I] Inputs:
[10/27/2022-13:44:51] [I] === Reporting Options ===
[10/27/2022-13:44:51] [I] Verbose: Disabled
[10/27/2022-13:44:51] [I] Averages: 10 inferences
[10/27/2022-13:44:51] [I] Percentile: 99
[10/27/2022-13:44:51] [I] Dump refittable layers:Disabled
[10/27/2022-13:44:51] [I] Dump output: Disabled
[10/27/2022-13:44:51] [I] Profile: Enabled
[10/27/2022-13:44:51] [I] Export timing to JSON file: 
[10/27/2022-13:44:51] [I] Export output to JSON file: 
[10/27/2022-13:44:51] [I] Export profile to JSON file: 
[10/27/2022-13:44:51] [I] 
[10/27/2022-13:44:51] [I] === Device Information ===
[10/27/2022-13:44:51] [I] Selected Device: Xavier
[10/27/2022-13:44:51] [I] Compute Capability: 7.2
[10/27/2022-13:44:51] [I] SMs: 8
[10/27/2022-13:44:51] [I] Compute Clock Rate: 1.377 GHz
[10/27/2022-13:44:51] [I] Device Global Memory: 31011 MiB
[10/27/2022-13:44:51] [I] Shared Memory per SM: 96 KiB
[10/27/2022-13:44:51] [I] Memory Bus Width: 256 bits (ECC disabled)
[10/27/2022-13:44:51] [I] Memory Clock Rate: 1.377 GHz
[10/27/2022-13:44:51] [I] 
[10/27/2022-13:44:51] [I] TensorRT version: 8.4.1
[10/27/2022-13:44:51] [I] Engine loaded in 0.00894217 sec.
[10/27/2022-13:44:52] [I] [TRT] [MemUsageChange] Init CUDA: CPU +186, GPU +0, now: CPU 215, GPU 3465 (MiB)
[10/27/2022-13:44:52] [I] [TRT] Loaded engine size: 5 MiB
[10/27/2022-13:44:52] [W] [TRT] Using an engine plan file across different models of devices is not recommended and is likely to affect performance or even cause errors.
[10/27/2022-13:44:52] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +5, GPU +0, now: CPU 5, GPU 0 (MiB)
[10/27/2022-13:44:52] [I] Engine deserialized in 1.08461 sec.
[10/27/2022-13:44:52] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +10, now: CPU 5, GPU 10 (MiB)
[10/27/2022-13:44:52] [I] Using random values for input input_1
[10/27/2022-13:44:52] [I] Created input binding for input_1 with dimensions 3x368x640
[10/27/2022-13:44:52] [I] Using random values for output conv2d_bbox
[10/27/2022-13:44:52] [I] Created output binding for conv2d_bbox with dimensions 16x23x40
[10/27/2022-13:44:52] [I] Using random values for output conv2d_cov/Sigmoid
[10/27/2022-13:44:52] [I] Created output binding for conv2d_cov/Sigmoid with dimensions 4x23x40
[10/27/2022-13:44:52] [I] Starting inference
[10/27/2022-13:44:55] [I] The e2e network timing is not reported since it is inaccurate due to the extra synchronizations when the profiler is enabled.
[10/27/2022-13:44:55] [I] To show e2e network timing report, add --separateProfileRun to profile layer timing in a separate run or remove --dumpProfile to disable the profiler.
[10/27/2022-13:44:55] [I] 
[10/27/2022-13:44:55] [I] === Profile (119 iterations ) ===
[10/27/2022-13:44:55] [I]                                                 Layer   Time (ms)   Avg. Time (ms)   Median Time (ms)   Time %
[10/27/2022-13:44:55] [I]                                        input_1 to nvm      247.47           2.0796             2.0790      8.6
[10/27/2022-13:44:55] [I]  {ForeignNode[conv1 + bn_conv1...conv2d_cov/Sigmoid]}        2.07           0.0174             0.0174      0.1
[10/27/2022-13:44:55] [I]                           conv2d_cov/Sigmoid from nvm     2611.04          21.9415            22.0744     90.8
[10/27/2022-13:44:55] [I]                                   input_1 copy finish        0.91           0.0077             0.0077      0.0
[10/27/2022-13:44:55] [I]                        conv2d_cov/Sigmoid copy finish        1.01           0.0085             0.0085      0.0
[10/27/2022-13:44:55] [I]                                  conv2d_bbox from nvm        9.99           0.0840             0.0840      0.3
[10/27/2022-13:44:55] [I]                               conv2d_bbox copy finish        2.11           0.0178             0.0178      0.1
[10/27/2022-13:44:55] [I]                                                 Total     2874.62          24.1564            24.2895    100.0
[10/27/2022-13:44:55] [I] 
&&&& PASSED TensorRT.trtexec [TensorRT v8401] # /usr/src/tensorrt/bin/trtexec --loadEngine=resnet10.caffemodel_b6_dla0_fp16.engine --useDLACore=0 --allowGPUFallback --batch=6 --dumpProfile

inference time of orin is :140.09ms
xavier is : 24.15ms

So the bottleneck is DLA itself. Why do you have to enable DLA for inferencing? What is your use case?

Orin DLA FP16 TOPs is lower than Xavier DLA FP16 TOPs, but Orin DLA INT8 dense 20 Tops/sparsity 40 Tops is higher than Xavier DLA INT8 5 Tops, so it is better to use DLA INT8 with Orin than to use DLA FP16.

According to the document description on Orin DLA support 1/3 of the overall AI performance. we have multiple models run on orin.So we want run some AI with DLA.
Do you have specific data that how much lower than xavier DLA FP16 TOPS.

what does this mean “conv2d_cov/Sigmoid from nvm 3425.06 137.0022”

We have some data but we can not share with you.

conv2d_cov/Sigmoid is one layer inside the model. Depends on the model you run, time is measured in layer level.

emm… why conv2d_cov/Sigmoid use so long times. this looks like a defect of tensorrt or DLA

If you think it is a defect, you can raise it to TensorRT forum. Latest tensorrt topics in GPU-Accelerated Libraries - NVIDIA Developer Forums

Thank you,I will new a topic.
About this topic , I have doubts what cased Orin FG16 DLA perfomance low than Xavier. I see that Orin also have to DLA,and frequence is 1.6GHz, xavier is 1.4GHZ. And I test FP16 GPU performance is also lower than xavier:
image

Is this result consistent with your internal test conclusion

There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one.
Thanks

How did you test the FPS? With deepstream-app sample? Have followed the method in Performance — DeepStream 6.1.1 Release documentation?

Please refer to the performance data here: Performance — DeepStream 6.1.1 Release documentation

For your YoloV7 model, what is the postprocessing? Is the postprocessing done in CPU?