&&&& RUNNING TensorRT.trtexec [TensorRT v8201] # /usr/src/tensorrt/bin/trtexec --loadEngine=resnet50_sim_mod_DLA_fp16.trt --useDLACore=0 --fp16 --dumpProfile [05/17/2022-15:33:17] [I] === Model Options === [05/17/2022-15:33:17] [I] Format: * [05/17/2022-15:33:17] [I] Model: [05/17/2022-15:33:17] [I] Output: [05/17/2022-15:33:17] [I] === Build Options === [05/17/2022-15:33:17] [I] Max batch: 1 [05/17/2022-15:33:17] [I] Workspace: 16 MiB [05/17/2022-15:33:17] [I] minTiming: 1 [05/17/2022-15:33:17] [I] avgTiming: 8 [05/17/2022-15:33:17] [I] Precision: FP32+FP16 [05/17/2022-15:33:17] [I] Calibration: [05/17/2022-15:33:17] [I] Refit: Disabled [05/17/2022-15:33:17] [I] Sparsity: Disabled [05/17/2022-15:33:17] [I] Safe mode: Disabled [05/17/2022-15:33:17] [I] DirectIO mode: Disabled [05/17/2022-15:33:17] [I] Restricted mode: Disabled [05/17/2022-15:33:17] [I] Save engine: [05/17/2022-15:33:17] [I] Load engine: resnet50_sim_mod_DLA_fp16.trt [05/17/2022-15:33:17] [I] Profiling verbosity: 0 [05/17/2022-15:33:17] [I] Tactic sources: Using default tactic sources [05/17/2022-15:33:17] [I] timingCacheMode: local [05/17/2022-15:33:17] [I] timingCacheFile: [05/17/2022-15:33:17] [I] Input(s)s format: fp32:CHW [05/17/2022-15:33:17] [I] Output(s)s format: fp32:CHW [05/17/2022-15:33:17] [I] Input build shapes: model [05/17/2022-15:33:17] [I] Input calibration shapes: model [05/17/2022-15:33:17] [I] === System Options === [05/17/2022-15:33:17] [I] Device: 0 [05/17/2022-15:33:17] [I] DLACore: 0 [05/17/2022-15:33:17] [I] Plugins: [05/17/2022-15:33:17] [I] === Inference Options === [05/17/2022-15:33:17] [I] Batch: 1 [05/17/2022-15:33:17] [I] Input inference shapes: model [05/17/2022-15:33:17] [I] Iterations: 10 [05/17/2022-15:33:17] [I] Duration: 3s (+ 200ms warm up) [05/17/2022-15:33:17] [I] Sleep time: 0ms [05/17/2022-15:33:17] [I] Idle time: 0ms [05/17/2022-15:33:17] [I] Streams: 1 [05/17/2022-15:33:17] [I] ExposeDMA: Disabled [05/17/2022-15:33:17] [I] Data transfers: Enabled [05/17/2022-15:33:17] [I] Spin-wait: Disabled [05/17/2022-15:33:17] [I] Multithreading: Disabled [05/17/2022-15:33:17] [I] CUDA Graph: Disabled [05/17/2022-15:33:17] [I] Separate profiling: Disabled [05/17/2022-15:33:17] [I] Time Deserialize: Disabled [05/17/2022-15:33:17] [I] Time Refit: Disabled [05/17/2022-15:33:17] [I] Skip inference: Disabled [05/17/2022-15:33:17] [I] Inputs: [05/17/2022-15:33:17] [I] === Reporting Options === [05/17/2022-15:33:17] [I] Verbose: Disabled [05/17/2022-15:33:17] [I] Averages: 10 inferences [05/17/2022-15:33:17] [I] Percentile: 99 [05/17/2022-15:33:17] [I] Dump refittable layers:Disabled [05/17/2022-15:33:17] [I] Dump output: Disabled [05/17/2022-15:33:17] [I] Profile: Enabled [05/17/2022-15:33:17] [I] Export timing to JSON file: [05/17/2022-15:33:17] [I] Export output to JSON file: [05/17/2022-15:33:17] [I] Export profile to JSON file: [05/17/2022-15:33:17] [I] [05/17/2022-15:33:17] [I] === Device Information === [05/17/2022-15:33:17] [I] Selected Device: Xavier [05/17/2022-15:33:17] [I] Compute Capability: 7.2 [05/17/2022-15:33:17] [I] SMs: 8 [05/17/2022-15:33:17] [I] Compute Clock Rate: 1.377 GHz [05/17/2022-15:33:17] [I] Device Global Memory: 15824 MiB [05/17/2022-15:33:17] [I] Shared Memory per SM: 96 KiB [05/17/2022-15:33:17] [I] Memory Bus Width: 256 bits (ECC disabled) [05/17/2022-15:33:17] [I] Memory Clock Rate: 1.377 GHz [05/17/2022-15:33:17] [I] [05/17/2022-15:33:17] [I] TensorRT version: 8.2.1 [05/17/2022-15:33:18] [I] [TRT] [MemUsageChange] Init CUDA: CPU +362, GPU +0, now: CPU 438, GPU 2775 (MiB) [05/17/2022-15:33:18] [I] [TRT] Loaded engine size: 57 MiB [05/17/2022-15:33:19] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +226, GPU +381, now: CPU 723, GPU 3218 (MiB) [05/17/2022-15:33:21] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +307, GPU +510, now: CPU 1030, GPU 3728 (MiB) [05/17/2022-15:33:21] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +57, GPU +0, now: CPU 57, GPU 0 (MiB) [05/17/2022-15:33:21] [I] Engine loaded in 3.96407 sec. [05/17/2022-15:33:21] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 973, GPU 3672 (MiB) [05/17/2022-15:33:21] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +0, now: CPU 973, GPU 3672 (MiB) [05/17/2022-15:33:21] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 57, GPU 0 (MiB) [05/17/2022-15:33:21] [I] Using random values for input input [05/17/2022-15:33:21] [I] Created input binding for input with dimensions 1x3x224x224 [05/17/2022-15:33:21] [I] Using random values for output output [05/17/2022-15:33:21] [I] Created output binding for output with dimensions 1x2048x7x7 [05/17/2022-15:33:21] [I] Starting inference [05/17/2022-15:33:24] [W] The network timing report will not be accurate due to extra synchronizations when profiler is enabled. [05/17/2022-15:33:24] [W] Add --separateProfileRun to profile layer timing in a separate run. [05/17/2022-15:33:24] [I] Warmup completed 32 queries over 200 ms [05/17/2022-15:33:24] [I] Timing trace has 480 queries over 3.00963 s [05/17/2022-15:33:24] [I] [05/17/2022-15:33:24] [I] === Trace details === [05/17/2022-15:33:24] [I] Trace averages of 10 runs: [05/17/2022-15:33:24] [I] Average on 10 runs - GPU latency: 6.16177 ms - Host latency: 6.22369 ms (end to end 6.24003 ms, enqueue 6.14611 ms) [05/17/2022-15:33:24] [I] Average on 10 runs - GPU latency: 6.16862 ms - Host latency: 6.23669 ms (end to end 6.26102 ms, enqueue 6.16033 ms) [05/17/2022-15:33:24] [I] Average on 10 runs - GPU latency: 6.13758 ms - Host latency: 6.19655 ms (end to end 6.21532 ms, enqueue 6.12043 ms) [05/17/2022-15:33:24] [I] Average on 10 runs - GPU latency: 6.10946 ms - Host latency: 6.17657 ms (end to end 6.19315 ms, enqueue 6.10427 ms) [05/17/2022-15:33:24] [I] Average on 10 runs - GPU latency: 6.16763 ms - Host latency: 6.23172 ms (end to end 6.24612 ms, enqueue 6.16299 ms) [05/17/2022-15:33:24] [I] Average on 10 runs - GPU latency: 6.14086 ms - Host latency: 6.20319 ms (end to end 6.22084 ms, enqueue 6.13682 ms) [05/17/2022-15:33:24] [I] Average on 10 runs - GPU latency: 6.17618 ms - Host latency: 6.23749 ms (end to end 6.26259 ms, enqueue 6.16601 ms) [05/17/2022-15:33:24] [I] Average on 10 runs - GPU latency: 6.16353 ms - Host latency: 6.23046 ms (end to end 6.25458 ms, enqueue 6.15712 ms) [05/17/2022-15:33:24] [I] Average on 10 runs - GPU latency: 6.15438 ms - Host latency: 6.21342 ms (end to end 6.2404 ms, enqueue 6.14744 ms) [05/17/2022-15:33:24] [I] Average on 10 runs - GPU latency: 6.16199 ms - Host latency: 6.22346 ms (end to end 6.24208 ms, enqueue 6.1566 ms) [05/17/2022-15:33:24] [I] Average on 10 runs - GPU latency: 6.15498 ms - Host latency: 6.22508 ms (end to end 6.24077 ms, enqueue 6.15267 ms) [05/17/2022-15:33:24] [I] Average on 10 runs - GPU latency: 6.26512 ms - Host latency: 6.34066 ms (end to end 6.36868 ms, enqueue 6.25579 ms) [05/17/2022-15:33:24] [I] Average on 10 runs - GPU latency: 6.10889 ms - Host latency: 6.16764 ms (end to end 6.18762 ms, enqueue 6.10363 ms) [05/17/2022-15:33:24] [I] Average on 10 runs - GPU latency: 6.25175 ms - Host latency: 6.33154 ms (end to end 6.35705 ms, enqueue 6.24551 ms) [05/17/2022-15:33:24] [I] Average on 10 runs - GPU latency: 6.18284 ms - Host latency: 6.2514 ms (end to end 6.27094 ms, enqueue 6.17926 ms) [05/17/2022-15:33:24] [I] Average on 10 runs - GPU latency: 6.102 ms - Host latency: 6.16346 ms (end to end 6.17585 ms, enqueue 6.09996 ms) [05/17/2022-15:33:24] [I] Average on 10 runs - GPU latency: 6.20061 ms - Host latency: 6.27068 ms (end to end 6.28948 ms, enqueue 6.19811 ms) [05/17/2022-15:33:24] [I] Average on 10 runs - GPU latency: 6.14451 ms - Host latency: 6.21211 ms (end to end 6.2316 ms, enqueue 6.14548 ms) [05/17/2022-15:33:24] [I] Average on 10 runs - GPU latency: 6.21713 ms - Host latency: 6.28958 ms (end to end 6.30756 ms, enqueue 6.21417 ms) [05/17/2022-15:33:24] [I] Average on 10 runs - GPU latency: 6.09615 ms - Host latency: 6.15735 ms (end to end 6.16981 ms, enqueue 6.09597 ms) [05/17/2022-15:33:24] [I] Average on 10 runs - GPU latency: 6.19288 ms - Host latency: 6.26667 ms (end to end 6.28788 ms, enqueue 6.18798 ms) [05/17/2022-15:33:24] [I] Average on 10 runs - GPU latency: 6.21851 ms - Host latency: 6.28514 ms (end to end 6.30105 ms, enqueue 6.20518 ms) [05/17/2022-15:33:24] [I] Average on 10 runs - GPU latency: 6.19402 ms - Host latency: 6.26719 ms (end to end 6.28646 ms, enqueue 6.19226 ms) [05/17/2022-15:33:24] [I] Average on 10 runs - GPU latency: 6.16362 ms - Host latency: 6.22866 ms (end to end 6.24519 ms, enqueue 6.16112 ms) [05/17/2022-15:33:24] [I] Average on 10 runs - GPU latency: 6.15455 ms - Host latency: 6.21718 ms (end to end 6.23481 ms, enqueue 6.14338 ms) [05/17/2022-15:33:24] [I] Average on 10 runs - GPU latency: 6.15861 ms - Host latency: 6.2287 ms (end to end 6.24449 ms, enqueue 6.15287 ms) [05/17/2022-15:33:24] [I] Average on 10 runs - GPU latency: 6.18752 ms - Host latency: 6.25522 ms (end to end 6.27189 ms, enqueue 6.18057 ms) [05/17/2022-15:33:24] [I] Average on 10 runs - GPU latency: 6.14492 ms - Host latency: 6.21086 ms (end to end 6.22372 ms, enqueue 6.14467 ms) [05/17/2022-15:33:24] [I] Average on 10 runs - GPU latency: 6.13633 ms - Host latency: 6.2015 ms (end to end 6.21901 ms, enqueue 6.13101 ms) [05/17/2022-15:33:24] [I] Average on 10 runs - GPU latency: 6.07747 ms - Host latency: 6.13405 ms (end to end 6.15089 ms, enqueue 6.08212 ms) [05/17/2022-15:33:24] [I] Average on 10 runs - GPU latency: 6.10186 ms - Host latency: 6.16541 ms (end to end 6.18066 ms, enqueue 6.10496 ms) [05/17/2022-15:33:24] [I] Average on 10 runs - GPU latency: 6.1332 ms - Host latency: 6.20254 ms (end to end 6.21599 ms, enqueue 6.13643 ms) [05/17/2022-15:33:24] [I] Average on 10 runs - GPU latency: 6.04216 ms - Host latency: 6.10061 ms (end to end 6.11697 ms, enqueue 6.05093 ms) [05/17/2022-15:33:24] [I] Average on 10 runs - GPU latency: 6.23162 ms - Host latency: 6.30044 ms (end to end 6.3147 ms, enqueue 6.2271 ms) [05/17/2022-15:33:24] [I] Average on 10 runs - GPU latency: 6.11357 ms - Host latency: 6.17 ms (end to end 6.18445 ms, enqueue 6.1085 ms) [05/17/2022-15:33:24] [I] Average on 10 runs - GPU latency: 6.24709 ms - Host latency: 6.32913 ms (end to end 6.34402 ms, enqueue 6.24473 ms) [05/17/2022-15:33:24] [I] Average on 10 runs - GPU latency: 6.12112 ms - Host latency: 6.18643 ms (end to end 6.20518 ms, enqueue 6.12134 ms) [05/17/2022-15:33:24] [I] Average on 10 runs - GPU latency: 6.14211 ms - Host latency: 6.21328 ms (end to end 6.22866 ms, enqueue 6.14666 ms) [05/17/2022-15:33:24] [I] Average on 10 runs - GPU latency: 6.11426 ms - Host latency: 6.17644 ms (end to end 6.19192 ms, enqueue 6.11653 ms) [05/17/2022-15:33:24] [I] Average on 10 runs - GPU latency: 6.08694 ms - Host latency: 6.15312 ms (end to end 6.16914 ms, enqueue 6.08892 ms) [05/17/2022-15:33:24] [I] Average on 10 runs - GPU latency: 6.18696 ms - Host latency: 6.25774 ms (end to end 6.27104 ms, enqueue 6.18555 ms) [05/17/2022-15:33:24] [I] Average on 10 runs - GPU latency: 6.14526 ms - Host latency: 6.2137 ms (end to end 6.2281 ms, enqueue 6.14365 ms) [05/17/2022-15:33:24] [I] Average on 10 runs - GPU latency: 6.10569 ms - Host latency: 6.1689 ms (end to end 6.18518 ms, enqueue 6.11211 ms) [05/17/2022-15:33:24] [I] Average on 10 runs - GPU latency: 6.12554 ms - Host latency: 6.18892 ms (end to end 6.20344 ms, enqueue 6.12258 ms) [05/17/2022-15:33:24] [I] Average on 10 runs - GPU latency: 6.09829 ms - Host latency: 6.15979 ms (end to end 6.17576 ms, enqueue 6.10256 ms) [05/17/2022-15:33:24] [I] Average on 10 runs - GPU latency: 6.14817 ms - Host latency: 6.21365 ms (end to end 6.2332 ms, enqueue 6.15239 ms) [05/17/2022-15:33:24] [I] Average on 10 runs - GPU latency: 6.08428 ms - Host latency: 6.15459 ms (end to end 6.1688 ms, enqueue 6.08865 ms) [05/17/2022-15:33:24] [I] Average on 10 runs - GPU latency: 6.1167 ms - Host latency: 6.17871 ms (end to end 6.19155 ms, enqueue 6.11992 ms) [05/17/2022-15:33:24] [I] [05/17/2022-15:33:24] [I] === Performance summary === [05/17/2022-15:33:24] [I] Throughput: 159.488 qps [05/17/2022-15:33:24] [I] Latency: min = 5.91675 ms, max = 6.77173 ms, mean = 6.21691 ms, median = 6.20105 ms, percentile(99%) = 6.46057 ms [05/17/2022-15:33:24] [I] End-to-End Host Latency: min = 5.92798 ms, max = 6.77637 ms, mean = 6.23437 ms, median = 6.21474 ms, percentile(99%) = 6.49341 ms [05/17/2022-15:33:24] [I] Enqueue Time: min = 6.00415 ms, max = 6.45044 ms, mean = 6.14799 ms, median = 6.13251 ms, percentile(99%) = 6.37109 ms [05/17/2022-15:33:24] [I] H2D Latency: min = 0.019043 ms, max = 0.0725098 ms, mean = 0.0314027 ms, median = 0.0292969 ms, percentile(99%) = 0.0529785 ms [05/17/2022-15:33:24] [I] GPU Compute Time: min = 5.86108 ms, max = 6.7251 ms, mean = 6.15082 ms, median = 6.13348 ms, percentile(99%) = 6.37897 ms [05/17/2022-15:33:24] [I] D2H Latency: min = 0.013092 ms, max = 0.0981445 ms, mean = 0.0346815 ms, median = 0.0299072 ms, percentile(99%) = 0.0751953 ms [05/17/2022-15:33:24] [I] Total Host Walltime: 3.00963 s [05/17/2022-15:33:24] [I] Total GPU Compute Time: 2.95239 s [05/17/2022-15:33:24] [W] * Throughput may be bound by Enqueue Time rather than GPU Compute and the GPU may be under-utilized. [05/17/2022-15:33:24] [W] If not already in use, --useCudaGraph (utilize CUDA graphs where possible) may increase the throughput. [05/17/2022-15:33:24] [I] Explanations of the performance metrics are printed in the verbose logs. [05/17/2022-15:33:24] [I] [05/17/2022-15:33:24] [I] [05/17/2022-15:33:24] [I] === Profile (512 iterations ) === [05/17/2022-15:33:24] [I] Layer Time (ms) Avg. Time (ms) Time % [05/17/2022-15:33:24] [I] input to nvm 52.35 0.1022 1.7 [05/17/2022-15:33:24] [I] {ForeignNode[Conv_0...Relu_118]} 137.26 0.2681 4.5 [05/17/2022-15:33:24] [I] output from nvm 2886.75 5.6382 93.8 [05/17/2022-15:33:24] [I] input copy finish 1.39 0.0027 0.0 [05/17/2022-15:33:24] [I] output copy finish 1.26 0.0025 0.0 [05/17/2022-15:33:24] [I] Total 3079.00 6.0137 100.0 [05/17/2022-15:33:24] [I] &&&& PASSED TensorRT.trtexec [TensorRT v8201] # /usr/src/tensorrt/bin/trtexec --loadEngine=resnet50_sim_mod_DLA_fp16.trt --useDLACore=0 --fp16 --dumpProfile