&&&& RUNNING TensorRT.trtexec [TensorRT v8201] # /usr/src/tensorrt/bin/trtexec --loadEngine=resnet50_sim_mod_DLA_fp16.trt --useDLACore=0 --fp16 --dumpProfile
[05/17/2022-15:33:17] [I] === Model Options ===
[05/17/2022-15:33:17] [I] Format: *
[05/17/2022-15:33:17] [I] Model: 
[05/17/2022-15:33:17] [I] Output:
[05/17/2022-15:33:17] [I] === Build Options ===
[05/17/2022-15:33:17] [I] Max batch: 1
[05/17/2022-15:33:17] [I] Workspace: 16 MiB
[05/17/2022-15:33:17] [I] minTiming: 1
[05/17/2022-15:33:17] [I] avgTiming: 8
[05/17/2022-15:33:17] [I] Precision: FP32+FP16
[05/17/2022-15:33:17] [I] Calibration: 
[05/17/2022-15:33:17] [I] Refit: Disabled
[05/17/2022-15:33:17] [I] Sparsity: Disabled
[05/17/2022-15:33:17] [I] Safe mode: Disabled
[05/17/2022-15:33:17] [I] DirectIO mode: Disabled
[05/17/2022-15:33:17] [I] Restricted mode: Disabled
[05/17/2022-15:33:17] [I] Save engine: 
[05/17/2022-15:33:17] [I] Load engine: resnet50_sim_mod_DLA_fp16.trt
[05/17/2022-15:33:17] [I] Profiling verbosity: 0
[05/17/2022-15:33:17] [I] Tactic sources: Using default tactic sources
[05/17/2022-15:33:17] [I] timingCacheMode: local
[05/17/2022-15:33:17] [I] timingCacheFile: 
[05/17/2022-15:33:17] [I] Input(s)s format: fp32:CHW
[05/17/2022-15:33:17] [I] Output(s)s format: fp32:CHW
[05/17/2022-15:33:17] [I] Input build shapes: model
[05/17/2022-15:33:17] [I] Input calibration shapes: model
[05/17/2022-15:33:17] [I] === System Options ===
[05/17/2022-15:33:17] [I] Device: 0
[05/17/2022-15:33:17] [I] DLACore: 0
[05/17/2022-15:33:17] [I] Plugins:
[05/17/2022-15:33:17] [I] === Inference Options ===
[05/17/2022-15:33:17] [I] Batch: 1
[05/17/2022-15:33:17] [I] Input inference shapes: model
[05/17/2022-15:33:17] [I] Iterations: 10
[05/17/2022-15:33:17] [I] Duration: 3s (+ 200ms warm up)
[05/17/2022-15:33:17] [I] Sleep time: 0ms
[05/17/2022-15:33:17] [I] Idle time: 0ms
[05/17/2022-15:33:17] [I] Streams: 1
[05/17/2022-15:33:17] [I] ExposeDMA: Disabled
[05/17/2022-15:33:17] [I] Data transfers: Enabled
[05/17/2022-15:33:17] [I] Spin-wait: Disabled
[05/17/2022-15:33:17] [I] Multithreading: Disabled
[05/17/2022-15:33:17] [I] CUDA Graph: Disabled
[05/17/2022-15:33:17] [I] Separate profiling: Disabled
[05/17/2022-15:33:17] [I] Time Deserialize: Disabled
[05/17/2022-15:33:17] [I] Time Refit: Disabled
[05/17/2022-15:33:17] [I] Skip inference: Disabled
[05/17/2022-15:33:17] [I] Inputs:
[05/17/2022-15:33:17] [I] === Reporting Options ===
[05/17/2022-15:33:17] [I] Verbose: Disabled
[05/17/2022-15:33:17] [I] Averages: 10 inferences
[05/17/2022-15:33:17] [I] Percentile: 99
[05/17/2022-15:33:17] [I] Dump refittable layers:Disabled
[05/17/2022-15:33:17] [I] Dump output: Disabled
[05/17/2022-15:33:17] [I] Profile: Enabled
[05/17/2022-15:33:17] [I] Export timing to JSON file: 
[05/17/2022-15:33:17] [I] Export output to JSON file: 
[05/17/2022-15:33:17] [I] Export profile to JSON file: 
[05/17/2022-15:33:17] [I] 
[05/17/2022-15:33:17] [I] === Device Information ===
[05/17/2022-15:33:17] [I] Selected Device: Xavier
[05/17/2022-15:33:17] [I] Compute Capability: 7.2
[05/17/2022-15:33:17] [I] SMs: 8
[05/17/2022-15:33:17] [I] Compute Clock Rate: 1.377 GHz
[05/17/2022-15:33:17] [I] Device Global Memory: 15824 MiB
[05/17/2022-15:33:17] [I] Shared Memory per SM: 96 KiB
[05/17/2022-15:33:17] [I] Memory Bus Width: 256 bits (ECC disabled)
[05/17/2022-15:33:17] [I] Memory Clock Rate: 1.377 GHz
[05/17/2022-15:33:17] [I] 
[05/17/2022-15:33:17] [I] TensorRT version: 8.2.1
[05/17/2022-15:33:18] [I] [TRT] [MemUsageChange] Init CUDA: CPU +362, GPU +0, now: CPU 438, GPU 2775 (MiB)
[05/17/2022-15:33:18] [I] [TRT] Loaded engine size: 57 MiB
[05/17/2022-15:33:19] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +226, GPU +381, now: CPU 723, GPU 3218 (MiB)
[05/17/2022-15:33:21] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +307, GPU +510, now: CPU 1030, GPU 3728 (MiB)
[05/17/2022-15:33:21] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +57, GPU +0, now: CPU 57, GPU 0 (MiB)
[05/17/2022-15:33:21] [I] Engine loaded in 3.96407 sec.
[05/17/2022-15:33:21] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 973, GPU 3672 (MiB)
[05/17/2022-15:33:21] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +0, now: CPU 973, GPU 3672 (MiB)
[05/17/2022-15:33:21] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 57, GPU 0 (MiB)
[05/17/2022-15:33:21] [I] Using random values for input input
[05/17/2022-15:33:21] [I] Created input binding for input with dimensions 1x3x224x224
[05/17/2022-15:33:21] [I] Using random values for output output
[05/17/2022-15:33:21] [I] Created output binding for output with dimensions 1x2048x7x7
[05/17/2022-15:33:21] [I] Starting inference
[05/17/2022-15:33:24] [W] The network timing report will not be accurate due to extra synchronizations when profiler is enabled.
[05/17/2022-15:33:24] [W] Add --separateProfileRun to profile layer timing in a separate run.
[05/17/2022-15:33:24] [I] Warmup completed 32 queries over 200 ms
[05/17/2022-15:33:24] [I] Timing trace has 480 queries over 3.00963 s
[05/17/2022-15:33:24] [I] 
[05/17/2022-15:33:24] [I] === Trace details ===
[05/17/2022-15:33:24] [I] Trace averages of 10 runs:
[05/17/2022-15:33:24] [I] Average on 10 runs - GPU latency: 6.16177 ms - Host latency: 6.22369 ms (end to end 6.24003 ms, enqueue 6.14611 ms)
[05/17/2022-15:33:24] [I] Average on 10 runs - GPU latency: 6.16862 ms - Host latency: 6.23669 ms (end to end 6.26102 ms, enqueue 6.16033 ms)
[05/17/2022-15:33:24] [I] Average on 10 runs - GPU latency: 6.13758 ms - Host latency: 6.19655 ms (end to end 6.21532 ms, enqueue 6.12043 ms)
[05/17/2022-15:33:24] [I] Average on 10 runs - GPU latency: 6.10946 ms - Host latency: 6.17657 ms (end to end 6.19315 ms, enqueue 6.10427 ms)
[05/17/2022-15:33:24] [I] Average on 10 runs - GPU latency: 6.16763 ms - Host latency: 6.23172 ms (end to end 6.24612 ms, enqueue 6.16299 ms)
[05/17/2022-15:33:24] [I] Average on 10 runs - GPU latency: 6.14086 ms - Host latency: 6.20319 ms (end to end 6.22084 ms, enqueue 6.13682 ms)
[05/17/2022-15:33:24] [I] Average on 10 runs - GPU latency: 6.17618 ms - Host latency: 6.23749 ms (end to end 6.26259 ms, enqueue 6.16601 ms)
[05/17/2022-15:33:24] [I] Average on 10 runs - GPU latency: 6.16353 ms - Host latency: 6.23046 ms (end to end 6.25458 ms, enqueue 6.15712 ms)
[05/17/2022-15:33:24] [I] Average on 10 runs - GPU latency: 6.15438 ms - Host latency: 6.21342 ms (end to end 6.2404 ms, enqueue 6.14744 ms)
[05/17/2022-15:33:24] [I] Average on 10 runs - GPU latency: 6.16199 ms - Host latency: 6.22346 ms (end to end 6.24208 ms, enqueue 6.1566 ms)
[05/17/2022-15:33:24] [I] Average on 10 runs - GPU latency: 6.15498 ms - Host latency: 6.22508 ms (end to end 6.24077 ms, enqueue 6.15267 ms)
[05/17/2022-15:33:24] [I] Average on 10 runs - GPU latency: 6.26512 ms - Host latency: 6.34066 ms (end to end 6.36868 ms, enqueue 6.25579 ms)
[05/17/2022-15:33:24] [I] Average on 10 runs - GPU latency: 6.10889 ms - Host latency: 6.16764 ms (end to end 6.18762 ms, enqueue 6.10363 ms)
[05/17/2022-15:33:24] [I] Average on 10 runs - GPU latency: 6.25175 ms - Host latency: 6.33154 ms (end to end 6.35705 ms, enqueue 6.24551 ms)
[05/17/2022-15:33:24] [I] Average on 10 runs - GPU latency: 6.18284 ms - Host latency: 6.2514 ms (end to end 6.27094 ms, enqueue 6.17926 ms)
[05/17/2022-15:33:24] [I] Average on 10 runs - GPU latency: 6.102 ms - Host latency: 6.16346 ms (end to end 6.17585 ms, enqueue 6.09996 ms)
[05/17/2022-15:33:24] [I] Average on 10 runs - GPU latency: 6.20061 ms - Host latency: 6.27068 ms (end to end 6.28948 ms, enqueue 6.19811 ms)
[05/17/2022-15:33:24] [I] Average on 10 runs - GPU latency: 6.14451 ms - Host latency: 6.21211 ms (end to end 6.2316 ms, enqueue 6.14548 ms)
[05/17/2022-15:33:24] [I] Average on 10 runs - GPU latency: 6.21713 ms - Host latency: 6.28958 ms (end to end 6.30756 ms, enqueue 6.21417 ms)
[05/17/2022-15:33:24] [I] Average on 10 runs - GPU latency: 6.09615 ms - Host latency: 6.15735 ms (end to end 6.16981 ms, enqueue 6.09597 ms)
[05/17/2022-15:33:24] [I] Average on 10 runs - GPU latency: 6.19288 ms - Host latency: 6.26667 ms (end to end 6.28788 ms, enqueue 6.18798 ms)
[05/17/2022-15:33:24] [I] Average on 10 runs - GPU latency: 6.21851 ms - Host latency: 6.28514 ms (end to end 6.30105 ms, enqueue 6.20518 ms)
[05/17/2022-15:33:24] [I] Average on 10 runs - GPU latency: 6.19402 ms - Host latency: 6.26719 ms (end to end 6.28646 ms, enqueue 6.19226 ms)
[05/17/2022-15:33:24] [I] Average on 10 runs - GPU latency: 6.16362 ms - Host latency: 6.22866 ms (end to end 6.24519 ms, enqueue 6.16112 ms)
[05/17/2022-15:33:24] [I] Average on 10 runs - GPU latency: 6.15455 ms - Host latency: 6.21718 ms (end to end 6.23481 ms, enqueue 6.14338 ms)
[05/17/2022-15:33:24] [I] Average on 10 runs - GPU latency: 6.15861 ms - Host latency: 6.2287 ms (end to end 6.24449 ms, enqueue 6.15287 ms)
[05/17/2022-15:33:24] [I] Average on 10 runs - GPU latency: 6.18752 ms - Host latency: 6.25522 ms (end to end 6.27189 ms, enqueue 6.18057 ms)
[05/17/2022-15:33:24] [I] Average on 10 runs - GPU latency: 6.14492 ms - Host latency: 6.21086 ms (end to end 6.22372 ms, enqueue 6.14467 ms)
[05/17/2022-15:33:24] [I] Average on 10 runs - GPU latency: 6.13633 ms - Host latency: 6.2015 ms (end to end 6.21901 ms, enqueue 6.13101 ms)
[05/17/2022-15:33:24] [I] Average on 10 runs - GPU latency: 6.07747 ms - Host latency: 6.13405 ms (end to end 6.15089 ms, enqueue 6.08212 ms)
[05/17/2022-15:33:24] [I] Average on 10 runs - GPU latency: 6.10186 ms - Host latency: 6.16541 ms (end to end 6.18066 ms, enqueue 6.10496 ms)
[05/17/2022-15:33:24] [I] Average on 10 runs - GPU latency: 6.1332 ms - Host latency: 6.20254 ms (end to end 6.21599 ms, enqueue 6.13643 ms)
[05/17/2022-15:33:24] [I] Average on 10 runs - GPU latency: 6.04216 ms - Host latency: 6.10061 ms (end to end 6.11697 ms, enqueue 6.05093 ms)
[05/17/2022-15:33:24] [I] Average on 10 runs - GPU latency: 6.23162 ms - Host latency: 6.30044 ms (end to end 6.3147 ms, enqueue 6.2271 ms)
[05/17/2022-15:33:24] [I] Average on 10 runs - GPU latency: 6.11357 ms - Host latency: 6.17 ms (end to end 6.18445 ms, enqueue 6.1085 ms)
[05/17/2022-15:33:24] [I] Average on 10 runs - GPU latency: 6.24709 ms - Host latency: 6.32913 ms (end to end 6.34402 ms, enqueue 6.24473 ms)
[05/17/2022-15:33:24] [I] Average on 10 runs - GPU latency: 6.12112 ms - Host latency: 6.18643 ms (end to end 6.20518 ms, enqueue 6.12134 ms)
[05/17/2022-15:33:24] [I] Average on 10 runs - GPU latency: 6.14211 ms - Host latency: 6.21328 ms (end to end 6.22866 ms, enqueue 6.14666 ms)
[05/17/2022-15:33:24] [I] Average on 10 runs - GPU latency: 6.11426 ms - Host latency: 6.17644 ms (end to end 6.19192 ms, enqueue 6.11653 ms)
[05/17/2022-15:33:24] [I] Average on 10 runs - GPU latency: 6.08694 ms - Host latency: 6.15312 ms (end to end 6.16914 ms, enqueue 6.08892 ms)
[05/17/2022-15:33:24] [I] Average on 10 runs - GPU latency: 6.18696 ms - Host latency: 6.25774 ms (end to end 6.27104 ms, enqueue 6.18555 ms)
[05/17/2022-15:33:24] [I] Average on 10 runs - GPU latency: 6.14526 ms - Host latency: 6.2137 ms (end to end 6.2281 ms, enqueue 6.14365 ms)
[05/17/2022-15:33:24] [I] Average on 10 runs - GPU latency: 6.10569 ms - Host latency: 6.1689 ms (end to end 6.18518 ms, enqueue 6.11211 ms)
[05/17/2022-15:33:24] [I] Average on 10 runs - GPU latency: 6.12554 ms - Host latency: 6.18892 ms (end to end 6.20344 ms, enqueue 6.12258 ms)
[05/17/2022-15:33:24] [I] Average on 10 runs - GPU latency: 6.09829 ms - Host latency: 6.15979 ms (end to end 6.17576 ms, enqueue 6.10256 ms)
[05/17/2022-15:33:24] [I] Average on 10 runs - GPU latency: 6.14817 ms - Host latency: 6.21365 ms (end to end 6.2332 ms, enqueue 6.15239 ms)
[05/17/2022-15:33:24] [I] Average on 10 runs - GPU latency: 6.08428 ms - Host latency: 6.15459 ms (end to end 6.1688 ms, enqueue 6.08865 ms)
[05/17/2022-15:33:24] [I] Average on 10 runs - GPU latency: 6.1167 ms - Host latency: 6.17871 ms (end to end 6.19155 ms, enqueue 6.11992 ms)
[05/17/2022-15:33:24] [I] 
[05/17/2022-15:33:24] [I] === Performance summary ===
[05/17/2022-15:33:24] [I] Throughput: 159.488 qps
[05/17/2022-15:33:24] [I] Latency: min = 5.91675 ms, max = 6.77173 ms, mean = 6.21691 ms, median = 6.20105 ms, percentile(99%) = 6.46057 ms
[05/17/2022-15:33:24] [I] End-to-End Host Latency: min = 5.92798 ms, max = 6.77637 ms, mean = 6.23437 ms, median = 6.21474 ms, percentile(99%) = 6.49341 ms
[05/17/2022-15:33:24] [I] Enqueue Time: min = 6.00415 ms, max = 6.45044 ms, mean = 6.14799 ms, median = 6.13251 ms, percentile(99%) = 6.37109 ms
[05/17/2022-15:33:24] [I] H2D Latency: min = 0.019043 ms, max = 0.0725098 ms, mean = 0.0314027 ms, median = 0.0292969 ms, percentile(99%) = 0.0529785 ms
[05/17/2022-15:33:24] [I] GPU Compute Time: min = 5.86108 ms, max = 6.7251 ms, mean = 6.15082 ms, median = 6.13348 ms, percentile(99%) = 6.37897 ms
[05/17/2022-15:33:24] [I] D2H Latency: min = 0.013092 ms, max = 0.0981445 ms, mean = 0.0346815 ms, median = 0.0299072 ms, percentile(99%) = 0.0751953 ms
[05/17/2022-15:33:24] [I] Total Host Walltime: 3.00963 s
[05/17/2022-15:33:24] [I] Total GPU Compute Time: 2.95239 s
[05/17/2022-15:33:24] [W] * Throughput may be bound by Enqueue Time rather than GPU Compute and the GPU may be under-utilized.
[05/17/2022-15:33:24] [W]   If not already in use, --useCudaGraph (utilize CUDA graphs where possible) may increase the throughput.
[05/17/2022-15:33:24] [I] Explanations of the performance metrics are printed in the verbose logs.
[05/17/2022-15:33:24] [I] 
[05/17/2022-15:33:24] [I] 
[05/17/2022-15:33:24] [I] === Profile (512 iterations ) ===
[05/17/2022-15:33:24] [I]                             Layer   Time (ms)   Avg. Time (ms)   Time %
[05/17/2022-15:33:24] [I]                      input to nvm       52.35           0.1022      1.7
[05/17/2022-15:33:24] [I]  {ForeignNode[Conv_0...Relu_118]}      137.26           0.2681      4.5
[05/17/2022-15:33:24] [I]                   output from nvm     2886.75           5.6382     93.8
[05/17/2022-15:33:24] [I]                 input copy finish        1.39           0.0027      0.0
[05/17/2022-15:33:24] [I]                output copy finish        1.26           0.0025      0.0
[05/17/2022-15:33:24] [I]                             Total     3079.00           6.0137    100.0
[05/17/2022-15:33:24] [I] 
&&&& PASSED TensorRT.trtexec [TensorRT v8201] # /usr/src/tensorrt/bin/trtexec --loadEngine=resnet50_sim_mod_DLA_fp16.trt --useDLACore=0 --fp16 --dumpProfile