&&&& RUNNING TensorRT.trtexec [TensorRT v8201] # /usr/src/tensorrt/bin/trtexec --onnx=resnet50_sim_mod.onnx --saveEngine=resnet50_sim_mod_fp16_dla.trt --workspace=1000000 --useDLACore=0 --dumpProfile --fp16 [05/16/2022-14:04:43] [I] === Model Options === [05/16/2022-14:04:43] [I] Format: ONNX [05/16/2022-14:04:43] [I] Model: resnet50_sim_mod.onnx [05/16/2022-14:04:43] [I] Output: [05/16/2022-14:04:43] [I] === Build Options === [05/16/2022-14:04:43] [I] Max batch: explicit batch [05/16/2022-14:04:43] [I] Workspace: 1000000 MiB [05/16/2022-14:04:43] [I] minTiming: 1 [05/16/2022-14:04:43] [I] avgTiming: 8 [05/16/2022-14:04:43] [I] Precision: FP32+FP16 [05/16/2022-14:04:43] [I] Calibration: [05/16/2022-14:04:43] [I] Refit: Disabled [05/16/2022-14:04:43] [I] Sparsity: Disabled [05/16/2022-14:04:43] [I] Safe mode: Disabled [05/16/2022-14:04:43] [I] DirectIO mode: Disabled [05/16/2022-14:04:43] [I] Restricted mode: Disabled [05/16/2022-14:04:43] [I] Save engine: resnet50_sim_mod_fp16_dla.trt [05/16/2022-14:04:43] [I] Load engine: [05/16/2022-14:04:43] [I] Profiling verbosity: 0 [05/16/2022-14:04:43] [I] Tactic sources: Using default tactic sources [05/16/2022-14:04:43] [I] timingCacheMode: local [05/16/2022-14:04:43] [I] timingCacheFile: [05/16/2022-14:04:43] [I] Input(s)s format: fp32:CHW [05/16/2022-14:04:43] [I] Output(s)s format: fp32:CHW [05/16/2022-14:04:43] [I] Input build shapes: model [05/16/2022-14:04:43] [I] Input calibration shapes: model [05/16/2022-14:04:43] [I] === System Options === [05/16/2022-14:04:43] [I] Device: 0 [05/16/2022-14:04:43] [I] DLACore: 0 [05/16/2022-14:04:43] [I] Plugins: [05/16/2022-14:04:43] [I] === Inference Options === [05/16/2022-14:04:43] [I] Batch: Explicit [05/16/2022-14:04:43] [I] Input inference shapes: model [05/16/2022-14:04:43] [I] Iterations: 10 [05/16/2022-14:04:43] [I] Duration: 3s (+ 200ms warm up) [05/16/2022-14:04:43] [I] Sleep time: 0ms [05/16/2022-14:04:43] [I] Idle time: 0ms [05/16/2022-14:04:43] [I] Streams: 1 [05/16/2022-14:04:43] [I] ExposeDMA: Disabled [05/16/2022-14:04:43] [I] Data transfers: Enabled [05/16/2022-14:04:43] [I] Spin-wait: Disabled [05/16/2022-14:04:43] [I] Multithreading: Disabled [05/16/2022-14:04:43] [I] CUDA Graph: Disabled [05/16/2022-14:04:43] [I] Separate profiling: Disabled [05/16/2022-14:04:43] [I] Time Deserialize: Disabled [05/16/2022-14:04:43] [I] Time Refit: Disabled [05/16/2022-14:04:43] [I] Skip inference: Disabled [05/16/2022-14:04:43] [I] Inputs: [05/16/2022-14:04:43] [I] === Reporting Options === [05/16/2022-14:04:43] [I] Verbose: Disabled [05/16/2022-14:04:43] [I] Averages: 10 inferences [05/16/2022-14:04:43] [I] Percentile: 99 [05/16/2022-14:04:43] [I] Dump refittable layers:Disabled [05/16/2022-14:04:43] [I] Dump output: Disabled [05/16/2022-14:04:43] [I] Profile: Enabled [05/16/2022-14:04:43] [I] Export timing to JSON file: [05/16/2022-14:04:43] [I] Export output to JSON file: [05/16/2022-14:04:43] [I] Export profile to JSON file: [05/16/2022-14:04:43] [I] [05/16/2022-14:04:43] [I] === Device Information === [05/16/2022-14:04:43] [I] Selected Device: Xavier [05/16/2022-14:04:43] [I] Compute Capability: 7.2 [05/16/2022-14:04:43] [I] SMs: 8 [05/16/2022-14:04:43] [I] Compute Clock Rate: 1.377 GHz [05/16/2022-14:04:43] [I] Device Global Memory: 15824 MiB [05/16/2022-14:04:43] [I] Shared Memory per SM: 96 KiB [05/16/2022-14:04:43] [I] Memory Bus Width: 256 bits (ECC disabled) [05/16/2022-14:04:43] [I] Memory Clock Rate: 1.377 GHz [05/16/2022-14:04:43] [I] [05/16/2022-14:04:43] [I] TensorRT version: 8.2.1 [05/16/2022-14:04:44] [I] [TRT] [MemUsageChange] Init CUDA: CPU +362, GPU +0, now: CPU 381, GPU 7384 (MiB) [05/16/2022-14:04:44] [I] [TRT] [MemUsageSnapshot] Begin constructing builder kernel library: CPU 381 MiB, GPU 7384 MiB [05/16/2022-14:04:44] [I] [TRT] [MemUsageSnapshot] End constructing builder kernel library: CPU 486 MiB, GPU 7506 MiB [05/16/2022-14:04:44] [I] Start parsing network model [05/16/2022-14:04:44] [I] [TRT] ---------------------------------------------------------------- [05/16/2022-14:04:44] [I] [TRT] Input filename: resnet50_sim_mod.onnx [05/16/2022-14:04:44] [I] [TRT] ONNX IR version: 0.0.7 [05/16/2022-14:04:44] [I] [TRT] Opset version: 12 [05/16/2022-14:04:44] [I] [TRT] Producer name: [05/16/2022-14:04:44] [I] [TRT] Producer version: [05/16/2022-14:04:44] [I] [TRT] Domain: [05/16/2022-14:04:44] [I] [TRT] Model version: 0 [05/16/2022-14:04:44] [I] [TRT] Doc string: [05/16/2022-14:04:44] [I] [TRT] ---------------------------------------------------------------- [05/16/2022-14:04:44] [I] Finish parsing network model [05/16/2022-14:04:49] [I] [TRT] ---------- Layers Running on DLA ---------- [05/16/2022-14:04:49] [I] [TRT] [DlaLayer] {ForeignNode[Conv_0...Relu_118]} [05/16/2022-14:04:49] [I] [TRT] ---------- Layers Running on GPU ---------- [05/16/2022-14:04:49] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +218, GPU +193, now: CPU 804, GPU 7895 (MiB) [05/16/2022-14:04:50] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +308, GPU +397, now: CPU 1112, GPU 8292 (MiB) [05/16/2022-14:04:50] [I] [TRT] Local timing cache in use. Profiling results in this builder pass will not be stored. [05/16/2022-14:05:00] [I] [TRT] Detected 1 inputs and 1 output network tensors. [05/16/2022-14:05:01] [I] [TRT] Total Host Persistent Memory: 848 [05/16/2022-14:05:01] [I] [TRT] Total Device Persistent Memory: 0 [05/16/2022-14:05:01] [I] [TRT] Total Scratch Memory: 0 [05/16/2022-14:05:01] [I] [TRT] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 57 MiB, GPU 3 MiB [05/16/2022-14:05:01] [I] [TRT] [BlockAssignment] Algorithm ShiftNTopDown took 0.023745ms to assign 1 blocks to 1 nodes requiring 401408 bytes. [05/16/2022-14:05:01] [I] [TRT] Total Activation Memory: 401408 [05/16/2022-14:05:01] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 1202, GPU 8526 (MiB) [05/16/2022-14:05:01] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +0, now: CPU 1202, GPU 8526 (MiB) [05/16/2022-14:05:01] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +57, GPU +0, now: CPU 57, GPU 0 (MiB) [05/16/2022-14:05:01] [I] [TRT] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 1201, GPU 8506 (MiB) [05/16/2022-14:05:01] [I] [TRT] Loaded engine size: 57 MiB [05/16/2022-14:05:01] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 1259, GPU 8563 (MiB) [05/16/2022-14:05:01] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +0, now: CPU 1259, GPU 8563 (MiB) [05/16/2022-14:05:01] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +57, GPU +0, now: CPU 57, GPU 0 (MiB) [05/16/2022-14:05:01] [I] Engine built in 18.3995 sec. [05/16/2022-14:05:01] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 1007, GPU 8493 (MiB) [05/16/2022-14:05:01] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +0, now: CPU 1007, GPU 8493 (MiB) [05/16/2022-14:05:01] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 57, GPU 0 (MiB) [05/16/2022-14:05:01] [I] Using random values for input input [05/16/2022-14:05:01] [I] Created input binding for input with dimensions 1x3x224x224 [05/16/2022-14:05:01] [I] Using random values for output output [05/16/2022-14:05:01] [I] Created output binding for output with dimensions 1x2048x7x7 [05/16/2022-14:05:01] [I] Starting inference [05/16/2022-14:05:04] [W] The network timing report will not be accurate due to extra synchronizations when profiler is enabled. [05/16/2022-14:05:04] [W] Add --separateProfileRun to profile layer timing in a separate run. [05/16/2022-14:05:04] [I] Warmup completed 28 queries over 200 ms [05/16/2022-14:05:04] [I] Timing trace has 443 queries over 3.0113 s [05/16/2022-14:05:04] [I] [05/16/2022-14:05:04] [I] === Trace details === [05/16/2022-14:05:04] [I] Trace averages of 10 runs: [05/16/2022-14:05:04] [I] Average on 10 runs - GPU latency: 6.60095 ms - Host latency: 6.7933 ms (end to end 6.81645 ms, enqueue 6.65292 ms) [05/16/2022-14:05:04] [I] Average on 10 runs - GPU latency: 6.61253 ms - Host latency: 6.7887 ms (end to end 6.81218 ms, enqueue 6.6587 ms) [05/16/2022-14:05:04] [I] Average on 10 runs - GPU latency: 6.60471 ms - Host latency: 6.78934 ms (end to end 6.81396 ms, enqueue 6.65316 ms) [05/16/2022-14:05:04] [I] Average on 10 runs - GPU latency: 6.65162 ms - Host latency: 6.82603 ms (end to end 6.85786 ms, enqueue 6.70028 ms) [05/16/2022-14:05:04] [I] Average on 10 runs - GPU latency: 6.69004 ms - Host latency: 6.85826 ms (end to end 6.88353 ms, enqueue 6.73832 ms) [05/16/2022-14:05:04] [I] Average on 10 runs - GPU latency: 6.68021 ms - Host latency: 6.84971 ms (end to end 6.87103 ms, enqueue 6.72835 ms) [05/16/2022-14:05:04] [I] Average on 10 runs - GPU latency: 6.5172 ms - Host latency: 6.67108 ms (end to end 6.69028 ms, enqueue 6.57316 ms) [05/16/2022-14:05:04] [I] Average on 10 runs - GPU latency: 6.67014 ms - Host latency: 6.83824 ms (end to end 6.85875 ms, enqueue 6.70672 ms) [05/16/2022-14:05:04] [I] Average on 10 runs - GPU latency: 6.54563 ms - Host latency: 6.70219 ms (end to end 6.72347 ms, enqueue 6.60182 ms) [05/16/2022-14:05:04] [I] Average on 10 runs - GPU latency: 6.57349 ms - Host latency: 6.73435 ms (end to end 6.75483 ms, enqueue 6.62506 ms) [05/16/2022-14:05:04] [I] Average on 10 runs - GPU latency: 6.52805 ms - Host latency: 6.68517 ms (end to end 6.70396 ms, enqueue 6.58111 ms) [05/16/2022-14:05:04] [I] Average on 10 runs - GPU latency: 6.6538 ms - Host latency: 6.81616 ms (end to end 6.83609 ms, enqueue 6.69761 ms) [05/16/2022-14:05:04] [I] Average on 10 runs - GPU latency: 6.6546 ms - Host latency: 6.82239 ms (end to end 6.84305 ms, enqueue 6.69933 ms) [05/16/2022-14:05:04] [I] Average on 10 runs - GPU latency: 6.58823 ms - Host latency: 6.75159 ms (end to end 6.77368 ms, enqueue 6.6349 ms) [05/16/2022-14:05:04] [I] Average on 10 runs - GPU latency: 6.65253 ms - Host latency: 6.81373 ms (end to end 6.83363 ms, enqueue 6.69818 ms) [05/16/2022-14:05:04] [I] Average on 10 runs - GPU latency: 6.56354 ms - Host latency: 6.7248 ms (end to end 6.74287 ms, enqueue 6.62063 ms) [05/16/2022-14:05:04] [I] Average on 10 runs - GPU latency: 6.60122 ms - Host latency: 6.76779 ms (end to end 6.78663 ms, enqueue 6.6528 ms) [05/16/2022-14:05:04] [I] Average on 10 runs - GPU latency: 6.55004 ms - Host latency: 6.70066 ms (end to end 6.72362 ms, enqueue 6.60266 ms) [05/16/2022-14:05:04] [I] Average on 10 runs - GPU latency: 6.63007 ms - Host latency: 6.79489 ms (end to end 6.81493 ms, enqueue 6.68368 ms) [05/16/2022-14:05:04] [I] Average on 10 runs - GPU latency: 6.58262 ms - Host latency: 6.73569 ms (end to end 6.75347 ms, enqueue 6.63296 ms) [05/16/2022-14:05:04] [I] Average on 10 runs - GPU latency: 6.62531 ms - Host latency: 6.78098 ms (end to end 6.80038 ms, enqueue 6.67594 ms) [05/16/2022-14:05:04] [I] Average on 10 runs - GPU latency: 6.55659 ms - Host latency: 6.71296 ms (end to end 6.72992 ms, enqueue 6.60928 ms) [05/16/2022-14:05:04] [I] Average on 10 runs - GPU latency: 6.71295 ms - Host latency: 6.86343 ms (end to end 6.88108 ms, enqueue 6.76086 ms) [05/16/2022-14:05:04] [I] Average on 10 runs - GPU latency: 6.56323 ms - Host latency: 6.71904 ms (end to end 6.73936 ms, enqueue 6.61829 ms) [05/16/2022-14:05:04] [I] Average on 10 runs - GPU latency: 6.57831 ms - Host latency: 6.73324 ms (end to end 6.75173 ms, enqueue 6.62368 ms) [05/16/2022-14:05:04] [I] Average on 10 runs - GPU latency: 6.53754 ms - Host latency: 6.6918 ms (end to end 6.71027 ms, enqueue 6.5881 ms) [05/16/2022-14:05:04] [I] Average on 10 runs - GPU latency: 6.54022 ms - Host latency: 6.69572 ms (end to end 6.71373 ms, enqueue 6.59377 ms) [05/16/2022-14:05:04] [I] Average on 10 runs - GPU latency: 6.49011 ms - Host latency: 6.63799 ms (end to end 6.65461 ms, enqueue 6.54751 ms) [05/16/2022-14:05:04] [I] Average on 10 runs - GPU latency: 6.48867 ms - Host latency: 6.63464 ms (end to end 6.64978 ms, enqueue 6.54744 ms) [05/16/2022-14:05:04] [I] Average on 10 runs - GPU latency: 6.5269 ms - Host latency: 6.67793 ms (end to end 6.69492 ms, enqueue 6.58223 ms) [05/16/2022-14:05:04] [I] Average on 10 runs - GPU latency: 6.6469 ms - Host latency: 6.80781 ms (end to end 6.82666 ms, enqueue 6.69368 ms) [05/16/2022-14:05:04] [I] Average on 10 runs - GPU latency: 6.55908 ms - Host latency: 6.7177 ms (end to end 6.73652 ms, enqueue 6.60911 ms) [05/16/2022-14:05:04] [I] Average on 10 runs - GPU latency: 6.57036 ms - Host latency: 6.72583 ms (end to end 6.74297 ms, enqueue 6.61794 ms) [05/16/2022-14:05:04] [I] Average on 10 runs - GPU latency: 6.51675 ms - Host latency: 6.66685 ms (end to end 6.68665 ms, enqueue 6.57625 ms) [05/16/2022-14:05:04] [I] Average on 10 runs - GPU latency: 6.54316 ms - Host latency: 6.69363 ms (end to end 6.71135 ms, enqueue 6.59893 ms) [05/16/2022-14:05:04] [I] Average on 10 runs - GPU latency: 6.55393 ms - Host latency: 6.70659 ms (end to end 6.72437 ms, enqueue 6.61045 ms) [05/16/2022-14:05:04] [I] Average on 10 runs - GPU latency: 6.59653 ms - Host latency: 6.7509 ms (end to end 6.77141 ms, enqueue 6.64719 ms) [05/16/2022-14:05:04] [I] Average on 10 runs - GPU latency: 6.5853 ms - Host latency: 6.74182 ms (end to end 6.76113 ms, enqueue 6.64111 ms) [05/16/2022-14:05:04] [I] Average on 10 runs - GPU latency: 6.67732 ms - Host latency: 6.84563 ms (end to end 6.8647 ms, enqueue 6.72644 ms) [05/16/2022-14:05:04] [I] Average on 10 runs - GPU latency: 6.50918 ms - Host latency: 6.66367 ms (end to end 6.68355 ms, enqueue 6.56799 ms) [05/16/2022-14:05:04] [I] Average on 10 runs - GPU latency: 6.7 ms - Host latency: 6.84795 ms (end to end 6.8666 ms, enqueue 6.74104 ms) [05/16/2022-14:05:04] [I] Average on 10 runs - GPU latency: 6.61873 ms - Host latency: 6.77366 ms (end to end 6.79204 ms, enqueue 6.67278 ms) [05/16/2022-14:05:04] [I] Average on 10 runs - GPU latency: 6.54226 ms - Host latency: 6.70085 ms (end to end 6.71858 ms, enqueue 6.59861 ms) [05/16/2022-14:05:04] [I] Average on 10 runs - GPU latency: 6.52415 ms - Host latency: 6.674 ms (end to end 6.69038 ms, enqueue 6.57092 ms) [05/16/2022-14:05:04] [I] [05/16/2022-14:05:04] [I] === Performance summary === [05/16/2022-14:05:04] [I] Throughput: 147.112 qps [05/16/2022-14:05:04] [I] Latency: min = 6.47925 ms, max = 8.1394 ms, mean = 6.74782 ms, median = 6.72162 ms, percentile(99%) = 7.05273 ms [05/16/2022-14:05:04] [I] End-to-End Host Latency: min = 6.49268 ms, max = 8.15857 ms, mean = 6.76754 ms, median = 6.74536 ms, percentile(99%) = 7.07922 ms [05/16/2022-14:05:04] [I] Enqueue Time: min = 6.45605 ms, max = 8.02771 ms, mean = 6.63959 ms, median = 6.61646 ms, percentile(99%) = 6.93604 ms [05/16/2022-14:05:04] [I] H2D Latency: min = 0.0771484 ms, max = 0.137695 ms, mean = 0.0874968 ms, median = 0.0848389 ms, percentile(99%) = 0.128052 ms [05/16/2022-14:05:04] [I] GPU Compute Time: min = 6.33447 ms, max = 7.98218 ms, mean = 6.58845 ms, median = 6.56589 ms, percentile(99%) = 6.88953 ms [05/16/2022-14:05:04] [I] D2H Latency: min = 0.0515137 ms, max = 0.175903 ms, mean = 0.0718778 ms, median = 0.0684814 ms, percentile(99%) = 0.12439 ms [05/16/2022-14:05:04] [I] Total Host Walltime: 3.0113 s [05/16/2022-14:05:04] [I] Total GPU Compute Time: 2.91868 s [05/16/2022-14:05:04] [W] * Throughput may be bound by Enqueue Time rather than GPU Compute and the GPU may be under-utilized. [05/16/2022-14:05:04] [W] If not already in use, --useCudaGraph (utilize CUDA graphs where possible) may increase the throughput. [05/16/2022-14:05:04] [I] Explanations of the performance metrics are printed in the verbose logs. [05/16/2022-14:05:04] [I] [05/16/2022-14:05:04] [I] [05/16/2022-14:05:04] [I] === Profile (471 iterations ) === [05/16/2022-14:05:04] [I] Layer Time (ms) Avg. Time (ms) Time % [05/16/2022-14:05:04] [I] input to nvm 54.03 0.1147 1.8 [05/16/2022-14:05:04] [I] {ForeignNode[Conv_0...Relu_118]} 135.44 0.2876 4.5 [05/16/2022-14:05:04] [I] output from nvm 2789.26 5.9220 93.4 [05/16/2022-14:05:04] [I] input copy finish 4.55 0.0097 0.2 [05/16/2022-14:05:04] [I] output copy finish 4.30 0.0091 0.1 [05/16/2022-14:05:04] [I] Total 2987.58 6.3431 100.0 [05/16/2022-14:05:04] [I] &&&& PASSED TensorRT.trtexec [TensorRT v8201] # /usr/src/tensorrt/bin/trtexec --onnx=resnet50_sim_mod.onnx --saveEngine=resnet50_sim_mod_fp16_dla.trt --workspace=1000000 --useDLACore=0 --dumpProfile --fp16