I measure the inferencing time with trtexec ,it’s failed:
/opt/nvidia/deepstream/deepstream-6.1/samples/models/Primary_Detector$ trtexec --loadEngine=resnet10.caffemodel_b6_dla0_fp16.engine
&&&& RUNNING TensorRT.trtexec [TensorRT v8401] # /usr/src/tensorrt/bin/trtexec --loadEngine=resnet10.caffemodel_b6_dla0_fp16.engine
[10/26/2022-20:28:53] [I] === Model Options ===
[10/26/2022-20:28:53] [I] Format: *
[10/26/2022-20:28:53] [I] Model:
[10/26/2022-20:28:53] [I] Output:
[10/26/2022-20:28:53] [I] === Build Options ===
[10/26/2022-20:28:53] [I] Max batch: 1
[10/26/2022-20:28:53] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default
[10/26/2022-20:28:53] [I] minTiming: 1
[10/26/2022-20:28:53] [I] avgTiming: 8
[10/26/2022-20:28:53] [I] Precision: FP32
[10/26/2022-20:28:53] [I] LayerPrecisions:
[10/26/2022-20:28:53] [I] Calibration:
[10/26/2022-20:28:53] [I] Refit: Disabled
[10/26/2022-20:28:53] [I] Sparsity: Disabled
[10/26/2022-20:28:53] [I] Safe mode: Disabled
[10/26/2022-20:28:53] [I] DirectIO mode: Disabled
[10/26/2022-20:28:53] [I] Restricted mode: Disabled
[10/26/2022-20:28:53] [I] Build only: Disabled
[10/26/2022-20:28:53] [I] Save engine:
[10/26/2022-20:28:53] [I] Load engine: resnet10.caffemodel_b6_dla0_fp16.engine
[10/26/2022-20:28:53] [I] Profiling verbosity: 0
[10/26/2022-20:28:53] [I] Tactic sources: Using default tactic sources
[10/26/2022-20:28:53] [I] timingCacheMode: local
[10/26/2022-20:28:53] [I] timingCacheFile:
[10/26/2022-20:28:53] [I] Input(s)s format: fp32:CHW
[10/26/2022-20:28:53] [I] Output(s)s format: fp32:CHW
[10/26/2022-20:28:53] [I] Input build shapes: model
[10/26/2022-20:28:53] [I] Input calibration shapes: model
[10/26/2022-20:28:53] [I] === System Options ===
[10/26/2022-20:28:53] [I] Device: 0
[10/26/2022-20:28:53] [I] DLACore:
[10/26/2022-20:28:53] [I] Plugins:
[10/26/2022-20:28:53] [I] === Inference Options ===
[10/26/2022-20:28:53] [I] Batch: 1
[10/26/2022-20:28:53] [I] Input inference shapes: model
[10/26/2022-20:28:53] [I] Iterations: 10
[10/26/2022-20:28:53] [I] Duration: 3s (+ 200ms warm up)
[10/26/2022-20:28:53] [I] Sleep time: 0ms
[10/26/2022-20:28:53] [I] Idle time: 0ms
[10/26/2022-20:28:53] [I] Streams: 1
[10/26/2022-20:28:53] [I] ExposeDMA: Disabled
[10/26/2022-20:28:53] [I] Data transfers: Enabled
[10/26/2022-20:28:53] [I] Spin-wait: Disabled
[10/26/2022-20:28:53] [I] Multithreading: Disabled
[10/26/2022-20:28:53] [I] CUDA Graph: Disabled
[10/26/2022-20:28:53] [I] Separate profiling: Disabled
[10/26/2022-20:28:53] [I] Time Deserialize: Disabled
[10/26/2022-20:28:53] [I] Time Refit: Disabled
[10/26/2022-20:28:53] [I] Inputs:
[10/26/2022-20:28:53] [I] === Reporting Options ===
[10/26/2022-20:28:53] [I] Verbose: Disabled
[10/26/2022-20:28:53] [I] Averages: 10 inferences
[10/26/2022-20:28:53] [I] Percentile: 99
[10/26/2022-20:28:53] [I] Dump refittable layers:Disabled
[10/26/2022-20:28:53] [I] Dump output: Disabled
[10/26/2022-20:28:53] [I] Profile: Disabled
[10/26/2022-20:28:53] [I] Export timing to JSON file:
[10/26/2022-20:28:53] [I] Export output to JSON file:
[10/26/2022-20:28:53] [I] Export profile to JSON file:
[10/26/2022-20:28:53] [I]
[10/26/2022-20:28:53] [I] === Device Information ===
[10/26/2022-20:28:53] [I] Selected Device: Orin
[10/26/2022-20:28:53] [I] Compute Capability: 8.7
[10/26/2022-20:28:53] [I] SMs: 16
[10/26/2022-20:28:53] [I] Compute Clock Rate: 1.3 GHz
[10/26/2022-20:28:53] [I] Device Global Memory: 30535 MiB
[10/26/2022-20:28:53] [I] Shared Memory per SM: 164 KiB
[10/26/2022-20:28:53] [I] Memory Bus Width: 128 bits (ECC disabled)
[10/26/2022-20:28:53] [I] Memory Clock Rate: 1.3 GHz
[10/26/2022-20:28:53] [I]
[10/26/2022-20:28:53] [I] TensorRT version: 8.4.1
[10/26/2022-20:28:53] [I] Engine loaded in 0.00268695 sec.
[10/26/2022-20:28:54] [I] [TRT] [MemUsageChange] Init CUDA: CPU +218, GPU +0, now: CPU 245, GPU 23642 (MiB)
[10/26/2022-20:28:54] [I] [TRT] Loaded engine size: 3 MiB
[10/26/2022-20:28:54] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +3, GPU +0, now: CPU 3, GPU 0 (MiB)
[10/26/2022-20:28:54] [I] Engine deserialized in 1.00899 sec.
[10/26/2022-20:28:54] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +51, now: CPU 3, GPU 51 (MiB)
[10/26/2022-20:28:54] [I] Using random values for input input_1
[10/26/2022-20:28:54] [I] Created input binding for input_1 with dimensions 3x368x640
[10/26/2022-20:28:54] [I] Using random values for output conv2d_bbox
[10/26/2022-20:28:54] [I] Created output binding for conv2d_bbox with dimensions 16x23x40
[10/26/2022-20:28:54] [I] Using random values for output conv2d_cov/Sigmoid
[10/26/2022-20:28:54] [I] Created output binding for conv2d_cov/Sigmoid with dimensions 4x23x40
[10/26/2022-20:28:54] [I] Starting inference
[10/26/2022-20:28:55] [E] Error[1]: [nvdlaUtils.cpp::submit::199] Error Code 1: DLA (Failure to submit program to DLA engine.)
[10/26/2022-20:28:55] [E] Error occurred during inference
&&&& FAILED TensorRT.trtexec [TensorRT v8401] # /usr/src/tensorrt/bin/trtexec --loadEngine=resnet10.caffemodel_b6_dla0_fp16.engine
terminate called after throwing an instance of 'nvinfer1::InternalError'
what(): Assertion !mCudaMemory || !mNvmTensor failed.
Aborted
resnet10.caffemodel_b6_dla0_fp16.engine is the engin auto created by deepstream-app.
I test to build trt with trtexec:
trtexec --deploy=resnet10.prototxt --model=resnet10.caffemodel --saveEngine=resnet10.engine --fp16 --output=‘conv2d_bbox’ --useDLACore=0 --allowGPUFallback
/opt/nvidia/deepstream/deepstream-6.1/samples/models/Primary_Detector$ trtexec --deploy=resnet10.prototxt --model=resnet10.caffemodel --saveEngine=resnet10.dla.engine --fp16 --output='conv2d_bbox' --useDLACore=0 --allowGPUFallback
&&&& RUNNING TensorRT.trtexec [TensorRT v8401] # /usr/src/tensorrt/bin/trtexec --deploy=resnet10.prototxt --model=resnet10.caffemodel --saveEngine=resnet10.dla.engine --fp16 --output=conv2d_bbox --useDLACore=0 --allowGPUFallback
[10/27/2022-10:48:53] [I] === Model Options ===
[10/27/2022-10:48:53] [I] Format: Caffe
[10/27/2022-10:48:53] [I] Model: resnet10.caffemodel
[10/27/2022-10:48:53] [I] Prototxt: resnet10.prototxt
[10/27/2022-10:48:53] [I] Output: conv2d_bbox
[10/27/2022-10:48:53] [I] === Build Options ===
[10/27/2022-10:48:53] [I] Max batch: 1
[10/27/2022-10:48:53] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default
[10/27/2022-10:48:53] [I] minTiming: 1
[10/27/2022-10:48:53] [I] avgTiming: 8
[10/27/2022-10:48:53] [I] Precision: FP32+FP16
[10/27/2022-10:48:53] [I] LayerPrecisions:
[10/27/2022-10:48:53] [I] Calibration:
[10/27/2022-10:48:53] [I] Refit: Disabled
[10/27/2022-10:48:53] [I] Sparsity: Disabled
[10/27/2022-10:48:53] [I] Safe mode: Disabled
[10/27/2022-10:48:53] [I] DirectIO mode: Disabled
[10/27/2022-10:48:53] [I] Restricted mode: Disabled
[10/27/2022-10:48:53] [I] Build only: Disabled
[10/27/2022-10:48:53] [I] Save engine: resnet10.dla.engine
[10/27/2022-10:48:53] [I] Load engine:
[10/27/2022-10:48:53] [I] Profiling verbosity: 0
[10/27/2022-10:48:53] [I] Tactic sources: Using default tactic sources
[10/27/2022-10:48:53] [I] timingCacheMode: local
[10/27/2022-10:48:53] [I] timingCacheFile:
[10/27/2022-10:48:53] [I] Input(s)s format: fp32:CHW
[10/27/2022-10:48:53] [I] Output(s)s format: fp32:CHW
[10/27/2022-10:48:53] [I] Input build shapes: model
[10/27/2022-10:48:53] [I] Input calibration shapes: model
[10/27/2022-10:48:53] [I] === System Options ===
[10/27/2022-10:48:53] [I] Device: 0
[10/27/2022-10:48:53] [I] DLACore: 0(With GPU fallback)
[10/27/2022-10:48:53] [I] Plugins:
[10/27/2022-10:48:53] [I] === Inference Options ===
[10/27/2022-10:48:53] [I] Batch: 1
[10/27/2022-10:48:53] [I] Input inference shapes: model
[10/27/2022-10:48:53] [I] Iterations: 10
[10/27/2022-10:48:53] [I] Duration: 3s (+ 200ms warm up)
[10/27/2022-10:48:53] [I] Sleep time: 0ms
[10/27/2022-10:48:53] [I] Idle time: 0ms
[10/27/2022-10:48:53] [I] Streams: 1
[10/27/2022-10:48:53] [I] ExposeDMA: Disabled
[10/27/2022-10:48:53] [I] Data transfers: Enabled
[10/27/2022-10:48:53] [I] Spin-wait: Disabled
[10/27/2022-10:48:53] [I] Multithreading: Disabled
[10/27/2022-10:48:53] [I] CUDA Graph: Disabled
[10/27/2022-10:48:53] [I] Separate profiling: Disabled
[10/27/2022-10:48:53] [I] Time Deserialize: Disabled
[10/27/2022-10:48:53] [I] Time Refit: Disabled
[10/27/2022-10:48:53] [I] Inputs:
[10/27/2022-10:48:53] [I] === Reporting Options ===
[10/27/2022-10:48:53] [I] Verbose: Disabled
[10/27/2022-10:48:53] [I] Averages: 10 inferences
[10/27/2022-10:48:53] [I] Percentile: 99
[10/27/2022-10:48:53] [I] Dump refittable layers:Disabled
[10/27/2022-10:48:53] [I] Dump output: Disabled
[10/27/2022-10:48:53] [I] Profile: Disabled
[10/27/2022-10:48:53] [I] Export timing to JSON file:
[10/27/2022-10:48:53] [I] Export output to JSON file:
[10/27/2022-10:48:53] [I] Export profile to JSON file:
[10/27/2022-10:48:53] [I]
[10/27/2022-10:48:53] [I] === Device Information ===
[10/27/2022-10:48:53] [I] Selected Device: Orin
[10/27/2022-10:48:53] [I] Compute Capability: 8.7
[10/27/2022-10:48:53] [I] SMs: 16
[10/27/2022-10:48:53] [I] Compute Clock Rate: 1.3 GHz
[10/27/2022-10:48:53] [I] Device Global Memory: 30535 MiB
[10/27/2022-10:48:53] [I] Shared Memory per SM: 164 KiB
[10/27/2022-10:48:53] [I] Memory Bus Width: 128 bits (ECC disabled)
[10/27/2022-10:48:53] [I] Memory Clock Rate: 1.3 GHz
[10/27/2022-10:48:53] [I]
[10/27/2022-10:48:53] [I] TensorRT version: 8.4.1
[10/27/2022-10:48:53] [I] [TRT] [MemUsageChange] Init CUDA: CPU +218, GPU +0, now: CPU 242, GPU 8269 (MiB)
[10/27/2022-10:48:56] [I] [TRT] [MemUsageChange] Init builder kernel library: CPU +351, GPU +330, now: CPU 612, GPU 8616 (MiB)
[10/27/2022-10:48:56] [W] [TRT] The implicit batch dimension mode has been deprecated. Please create the network with NetworkDefinitionCreationFlag::kEXPLICIT_BATCH flag whenever possible.
[10/27/2022-10:48:56] [I] Start parsing network model
[10/27/2022-10:48:56] [I] Finish parsing network model
[10/27/2022-10:48:57] [I] [TRT] ---------- Layers Running on DLA ----------
[10/27/2022-10:48:57] [I] [TRT] [DlaLayer] {ForeignNode[conv1 + bn_conv1...conv2d_bbox]}
[10/27/2022-10:48:57] [I] [TRT] ---------- Layers Running on GPU ----------
[10/27/2022-10:48:58] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +534, GPU +662, now: CPU 1160, GPU 9309 (MiB)
[10/27/2022-10:48:58] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +86, GPU +143, now: CPU 1246, GPU 9452 (MiB)
[10/27/2022-10:48:58] [I] [TRT] Local timing cache in use. Profiling results in this builder pass will not be stored.
[10/27/2022-10:49:00] [I] [TRT] Detected 1 inputs and 1 output network tensors.
[10/27/2022-10:49:00] [I] [TRT] Total Host Persistent Memory: 848
[10/27/2022-10:49:00] [I] [TRT] Total Device Persistent Memory: 0
[10/27/2022-10:49:00] [I] [TRT] Total Scratch Memory: 0
[10/27/2022-10:49:00] [I] [TRT] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 9 MiB, GPU 9 MiB
[10/27/2022-10:49:00] [I] [TRT] [BlockAssignment] Algorithm ShiftNTopDown took 0.014944ms to assign 2 blocks to 2 nodes requiring 8949760 bytes.
[10/27/2022-10:49:00] [I] [TRT] Total Activation Memory: 8949760
[10/27/2022-10:49:00] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +3, GPU +0, now: CPU 3, GPU 0 (MiB)
[10/27/2022-10:49:00] [I] Engine built in 7.04876 sec.
[10/27/2022-10:49:00] [I] [TRT] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 886, GPU 9495 (MiB)
[10/27/2022-10:49:00] [I] [TRT] Loaded engine size: 3 MiB
[10/27/2022-10:49:00] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +3, GPU +0, now: CPU 3, GPU 0 (MiB)
[10/27/2022-10:49:00] [I] Engine deserialized in 0.00217923 sec.
[10/27/2022-10:49:00] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +8, now: CPU 3, GPU 8 (MiB)
[10/27/2022-10:49:00] [I] Using random values for input input_1
[10/27/2022-10:49:00] [I] Created input binding for input_1 with dimensions 3x368x640
[10/27/2022-10:49:00] [I] Using random values for output conv2d_bbox
[10/27/2022-10:49:00] [I] Created output binding for conv2d_bbox with dimensions 16x23x40
[10/27/2022-10:49:00] [I] Starting inference
[10/27/2022-10:49:03] [I] Warmup completed 9 queries over 200 ms
[10/27/2022-10:49:03] [I] Timing trace has 133 queries over 3.0752 s
[10/27/2022-10:49:03] [I]
[10/27/2022-10:49:03] [I] === Trace details ===
[10/27/2022-10:49:03] [I] Trace averages of 10 runs:
[10/27/2022-10:49:03] [I] Average on 10 runs - GPU latency: 22.9712 ms - Host latency: 23.1162 ms (enqueue 22.8476 ms)
[10/27/2022-10:49:03] [I] Average on 10 runs - GPU latency: 23.0688 ms - Host latency: 23.2141 ms (enqueue 23.0029 ms)
[10/27/2022-10:49:03] [I] Average on 10 runs - GPU latency: 23.0611 ms - Host latency: 23.2024 ms (enqueue 23.0085 ms)
[10/27/2022-10:49:03] [I] Average on 10 runs - GPU latency: 22.9105 ms - Host latency: 23.0498 ms (enqueue 22.9018 ms)
[10/27/2022-10:49:03] [I] Average on 10 runs - GPU latency: 22.8784 ms - Host latency: 23.0175 ms (enqueue 22.8028 ms)
[10/27/2022-10:49:03] [I] Average on 10 runs - GPU latency: 22.9854 ms - Host latency: 23.1289 ms (enqueue 22.8964 ms)
[10/27/2022-10:49:03] [I] Average on 10 runs - GPU latency: 22.997 ms - Host latency: 23.143 ms (enqueue 22.9572 ms)
[10/27/2022-10:49:03] [I] Average on 10 runs - GPU latency: 22.9697 ms - Host latency: 23.1113 ms (enqueue 22.8876 ms)
[10/27/2022-10:49:03] [I] Average on 10 runs - GPU latency: 22.8642 ms - Host latency: 23.0036 ms (enqueue 22.8416 ms)
[10/27/2022-10:49:03] [I] Average on 10 runs - GPU latency: 22.9444 ms - Host latency: 23.0885 ms (enqueue 22.8449 ms)
[10/27/2022-10:49:03] [I] Average on 10 runs - GPU latency: 22.9925 ms - Host latency: 23.1364 ms (enqueue 22.9485 ms)
[10/27/2022-10:49:03] [I] Average on 10 runs - GPU latency: 22.8987 ms - Host latency: 23.0356 ms (enqueue 22.8807 ms)
[10/27/2022-10:49:03] [I] Average on 10 runs - GPU latency: 22.7963 ms - Host latency: 22.9571 ms (enqueue 22.7223 ms)
[10/27/2022-10:49:03] [I]
[10/27/2022-10:49:03] [I] === Performance summary ===
[10/27/2022-10:49:03] [I] Throughput: 43.2493 qps
[10/27/2022-10:49:03] [I] Latency: min = 22.8521 ms, max = 23.998 ms, mean = 23.0956 ms, median = 22.9065 ms, percentile(99%) = 23.8329 ms
[10/27/2022-10:49:03] [I] Enqueue Time: min = 22.3805 ms, max = 23.8909 ms, mean = 22.8901 ms, median = 22.8617 ms, percentile(99%) = 23.592 ms
[10/27/2022-10:49:03] [I] H2D Latency: min = 0.120605 ms, max = 0.1875 ms, mean = 0.133684 ms, median = 0.125244 ms, percentile(99%) = 0.182373 ms
[10/27/2022-10:49:03] [I] GPU Compute Time: min = 22.7177 ms, max = 23.8459 ms, mean = 22.9511 ms, median = 22.7659 ms, percentile(99%) = 23.6998 ms
[10/27/2022-10:49:03] [I] D2H Latency: min = 0.00708008 ms, max = 0.013916 ms, mean = 0.0108492 ms, median = 0.0109253 ms, percentile(99%) = 0.0134277 ms
[10/27/2022-10:49:03] [I] Total Host Walltime: 3.0752 s
[10/27/2022-10:49:03] [I] Total GPU Compute Time: 3.05249 s
[10/27/2022-10:49:03] [W] * Throughput may be bound by Enqueue Time rather than GPU Compute and the GPU may be under-utilized.
[10/27/2022-10:49:03] [W] If not already in use, --useCudaGraph (utilize CUDA graphs where possible) may increase the throughput.
[10/27/2022-10:49:03] [W] * GPU compute time is unstable, with coefficient of variance = 1.26437%.
[10/27/2022-10:49:03] [W] If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[10/27/2022-10:49:03] [I] Explanations of the performance metrics are printed in the verbose logs.
[10/27/2022-10:49:03] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v8401] # /usr/src/tensorrt/bin/trtexec --deploy=resnet10.prototxt --model=resnet10.caffemodel --saveEngine=resnet10.dla.engine --fp16 --output=conv2d_bbox --useDLACore=0 --allowGPUFallback
then measure the inference time:
/opt/nvidia/deepstream/deepstream-6.1/samples/models/Primary_Detector$ trtexec --loadEngine=resnet10.dla.engine --dumpProfile
&&&& RUNNING TensorRT.trtexec [TensorRT v8401] # /usr/src/tensorrt/bin/trtexec --loadEngine=resnet10.dla.engine --dumpProfile
[10/27/2022-10:49:38] [I] === Model Options ===
[10/27/2022-10:49:38] [I] Format: *
[10/27/2022-10:49:38] [I] Model:
[10/27/2022-10:49:38] [I] Output:
[10/27/2022-10:49:38] [I] === Build Options ===
[10/27/2022-10:49:38] [I] Max batch: 1
[10/27/2022-10:49:38] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default
[10/27/2022-10:49:38] [I] minTiming: 1
[10/27/2022-10:49:38] [I] avgTiming: 8
[10/27/2022-10:49:38] [I] Precision: FP32
[10/27/2022-10:49:38] [I] LayerPrecisions:
[10/27/2022-10:49:38] [I] Calibration:
[10/27/2022-10:49:38] [I] Refit: Disabled
[10/27/2022-10:49:38] [I] Sparsity: Disabled
[10/27/2022-10:49:38] [I] Safe mode: Disabled
[10/27/2022-10:49:38] [I] DirectIO mode: Disabled
[10/27/2022-10:49:38] [I] Restricted mode: Disabled
[10/27/2022-10:49:38] [I] Build only: Disabled
[10/27/2022-10:49:38] [I] Save engine:
[10/27/2022-10:49:38] [I] Load engine: resnet10.dla.engine
[10/27/2022-10:49:38] [I] Profiling verbosity: 0
[10/27/2022-10:49:38] [I] Tactic sources: Using default tactic sources
[10/27/2022-10:49:38] [I] timingCacheMode: local
[10/27/2022-10:49:38] [I] timingCacheFile:
[10/27/2022-10:49:38] [I] Input(s)s format: fp32:CHW
[10/27/2022-10:49:38] [I] Output(s)s format: fp32:CHW
[10/27/2022-10:49:38] [I] Input build shapes: model
[10/27/2022-10:49:38] [I] Input calibration shapes: model
[10/27/2022-10:49:38] [I] === System Options ===
[10/27/2022-10:49:38] [I] Device: 0
[10/27/2022-10:49:38] [I] DLACore:
[10/27/2022-10:49:38] [I] Plugins:
[10/27/2022-10:49:38] [I] === Inference Options ===
[10/27/2022-10:49:38] [I] Batch: 1
[10/27/2022-10:49:38] [I] Input inference shapes: model
[10/27/2022-10:49:38] [I] Iterations: 10
[10/27/2022-10:49:38] [I] Duration: 3s (+ 200ms warm up)
[10/27/2022-10:49:38] [I] Sleep time: 0ms
[10/27/2022-10:49:38] [I] Idle time: 0ms
[10/27/2022-10:49:38] [I] Streams: 1
[10/27/2022-10:49:38] [I] ExposeDMA: Disabled
[10/27/2022-10:49:38] [I] Data transfers: Enabled
[10/27/2022-10:49:38] [I] Spin-wait: Disabled
[10/27/2022-10:49:38] [I] Multithreading: Disabled
[10/27/2022-10:49:38] [I] CUDA Graph: Disabled
[10/27/2022-10:49:38] [I] Separate profiling: Disabled
[10/27/2022-10:49:38] [I] Time Deserialize: Disabled
[10/27/2022-10:49:38] [I] Time Refit: Disabled
[10/27/2022-10:49:38] [I] Inputs:
[10/27/2022-10:49:38] [I] === Reporting Options ===
[10/27/2022-10:49:38] [I] Verbose: Disabled
[10/27/2022-10:49:38] [I] Averages: 10 inferences
[10/27/2022-10:49:38] [I] Percentile: 99
[10/27/2022-10:49:38] [I] Dump refittable layers:Disabled
[10/27/2022-10:49:38] [I] Dump output: Disabled
[10/27/2022-10:49:38] [I] Profile: Enabled
[10/27/2022-10:49:38] [I] Export timing to JSON file:
[10/27/2022-10:49:38] [I] Export output to JSON file:
[10/27/2022-10:49:38] [I] Export profile to JSON file:
[10/27/2022-10:49:38] [I]
[10/27/2022-10:49:39] [I] === Device Information ===
[10/27/2022-10:49:39] [I] Selected Device: Orin
[10/27/2022-10:49:39] [I] Compute Capability: 8.7
[10/27/2022-10:49:39] [I] SMs: 16
[10/27/2022-10:49:39] [I] Compute Clock Rate: 1.3 GHz
[10/27/2022-10:49:39] [I] Device Global Memory: 30535 MiB
[10/27/2022-10:49:39] [I] Shared Memory per SM: 164 KiB
[10/27/2022-10:49:39] [I] Memory Bus Width: 128 bits (ECC disabled)
[10/27/2022-10:49:39] [I] Memory Clock Rate: 1.3 GHz
[10/27/2022-10:49:39] [I]
[10/27/2022-10:49:39] [I] TensorRT version: 8.4.1
[10/27/2022-10:49:39] [I] Engine loaded in 0.003854 sec.
[10/27/2022-10:49:39] [I] [TRT] [MemUsageChange] Init CUDA: CPU +218, GPU +0, now: CPU 245, GPU 8271 (MiB)
[10/27/2022-10:49:39] [I] [TRT] Loaded engine size: 3 MiB
[10/27/2022-10:49:39] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +3, GPU +0, now: CPU 3, GPU 0 (MiB)
[10/27/2022-10:49:39] [I] Engine deserialized in 0.525183 sec.
[10/27/2022-10:49:39] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +8, now: CPU 3, GPU 8 (MiB)
[10/27/2022-10:49:39] [I] Using random values for input input_1
[10/27/2022-10:49:39] [I] Created input binding for input_1 with dimensions 3x368x640
[10/27/2022-10:49:39] [I] Using random values for output conv2d_bbox
[10/27/2022-10:49:39] [I] Created output binding for conv2d_bbox with dimensions 16x23x40
[10/27/2022-10:49:39] [I] Starting inference
[10/27/2022-10:49:42] [I] The e2e network timing is not reported since it is inaccurate due to the extra synchronizations when the profiler is enabled.
[10/27/2022-10:49:42] [I] To show e2e network timing report, add --separateProfileRun to profile layer timing in a separate run or remove --dumpProfile to disable the profiler.
[10/27/2022-10:49:42] [I]
[10/27/2022-10:49:42] [I] === Profile (139 iterations ) ===
[10/27/2022-10:49:42] [I] Layer Time (ms) Avg. Time (ms) Median Time (ms) Time %
[10/27/2022-10:49:42] [I] input_1 to nvm 12.62 0.0908 0.0903 0.4
[10/27/2022-10:49:42] [I] Reformatting CopyNode for Input Tensor 0 to {ForeignNode[conv1 + bn_conv1...conv2d_bbox]} 54.65 0.3932 0.3930 1.7
[10/27/2022-10:49:42] [I] {ForeignNode[conv1 + bn_conv1...conv2d_bbox]} 4.31 0.0310 0.0307 0.1
[10/27/2022-10:49:42] [I] conv2d_bbox from nvm 3132.38 22.5351 22.5350 97.7
[10/27/2022-10:49:42] [I] Reformatted Input Tensor 0 to {ForeignNode[conv1 + bn_conv1...conv2d_bbox]} finish 0.59 0.0043 0.0042 0.0
[10/27/2022-10:49:42] [I] conv2d_bbox copy finish 0.90 0.0065 0.0064 0.0
[10/27/2022-10:49:42] [I] Total 3205.45 23.0608 23.0599 100.0
[10/27/2022-10:49:42] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v8401] # /usr/src/tensorrt/bin/trtexec --loadEngine=resnet10.dla.engine --dumpProfile