&&&& RUNNING TensorRT.trtexec [TensorRT v8201] # trtexec --onnx=./onnx_models/decoder_lm_xsum_0129.onnx --saveEngine=./onnx_models/trt/decoder_xsum_0129.trt --minShapes=input_ids:1x1,encoder_hidden_states:1x1x1024 --optShapes=input_ids:1x256,encoder_hidden_states:1x256x1024 --maxShapes=input_ids:1x1024,encoder_hidden_states:1x1024x1024 --workspace=4000 [01/30/2022-19:17:12] [I] === Model Options === [01/30/2022-19:17:12] [I] Format: ONNX [01/30/2022-19:17:12] [I] Model: ./onnx_models/decoder_lm_xsum_0129.onnx [01/30/2022-19:17:12] [I] Output: [01/30/2022-19:17:12] [I] === Build Options === [01/30/2022-19:17:12] [I] Max batch: explicit batch [01/30/2022-19:17:12] [I] Workspace: 4000 MiB [01/30/2022-19:17:12] [I] minTiming: 1 [01/30/2022-19:17:12] [I] avgTiming: 8 [01/30/2022-19:17:12] [I] Precision: FP32 [01/30/2022-19:17:12] [I] Calibration: [01/30/2022-19:17:12] [I] Refit: Disabled [01/30/2022-19:17:12] [I] Sparsity: Disabled [01/30/2022-19:17:12] [I] Safe mode: Disabled [01/30/2022-19:17:12] [I] DirectIO mode: Disabled [01/30/2022-19:17:12] [I] Restricted mode: Disabled [01/30/2022-19:17:12] [I] Save engine: ./onnx_models/trt/decoder_xsum_0129.trt [01/30/2022-19:17:12] [I] Load engine: [01/30/2022-19:17:12] [I] Profiling verbosity: 0 [01/30/2022-19:17:12] [I] Tactic sources: Using default tactic sources [01/30/2022-19:17:12] [I] timingCacheMode: local [01/30/2022-19:17:12] [I] timingCacheFile: [01/30/2022-19:17:12] [I] Input(s)s format: fp32:CHW [01/30/2022-19:17:12] [I] Output(s)s format: fp32:CHW [01/30/2022-19:17:12] [I] Input build shape: encoder_hidden_states=1x1x1024+1x256x1024+1x1024x1024 [01/30/2022-19:17:12] [I] Input build shape: input_ids=1x1+1x256+1x1024 [01/30/2022-19:17:12] [I] Input calibration shapes: model [01/30/2022-19:17:12] [I] === System Options === [01/30/2022-19:17:12] [I] Device: 0 [01/30/2022-19:17:12] [I] DLACore: [01/30/2022-19:17:12] [I] Plugins: [01/30/2022-19:17:12] [I] === Inference Options === [01/30/2022-19:17:12] [I] Batch: Explicit [01/30/2022-19:17:12] [I] Input inference shape: input_ids=1x256 [01/30/2022-19:17:12] [I] Input inference shape: encoder_hidden_states=1x256x1024 [01/30/2022-19:17:12] [I] Iterations: 10 [01/30/2022-19:17:12] [I] Duration: 3s (+ 200ms warm up) [01/30/2022-19:17:12] [I] Sleep time: 0ms [01/30/2022-19:17:12] [I] Idle time: 0ms [01/30/2022-19:17:12] [I] Streams: 1 [01/30/2022-19:17:12] [I] ExposeDMA: Disabled [01/30/2022-19:17:12] [I] Data transfers: Enabled [01/30/2022-19:17:12] [I] Spin-wait: Disabled [01/30/2022-19:17:12] [I] Multithreading: Disabled [01/30/2022-19:17:12] [I] CUDA Graph: Disabled [01/30/2022-19:17:12] [I] Separate profiling: Disabled [01/30/2022-19:17:12] [I] Time Deserialize: Disabled [01/30/2022-19:17:12] [I] Time Refit: Disabled [01/30/2022-19:17:12] [I] Skip inference: Disabled [01/30/2022-19:17:12] [I] Inputs: [01/30/2022-19:17:12] [I] === Reporting Options === [01/30/2022-19:17:12] [I] Verbose: Disabled [01/30/2022-19:17:12] [I] Averages: 10 inferences [01/30/2022-19:17:12] [I] Percentile: 99 [01/30/2022-19:17:12] [I] Dump refittable layers:Disabled [01/30/2022-19:17:12] [I] Dump output: Disabled [01/30/2022-19:17:12] [I] Profile: Disabled [01/30/2022-19:17:12] [I] Export timing to JSON file: [01/30/2022-19:17:12] [I] Export output to JSON file: [01/30/2022-19:17:12] [I] Export profile to JSON file: [01/30/2022-19:17:12] [I] [01/30/2022-19:17:13] [I] === Device Information === [01/30/2022-19:17:13] [I] Selected Device: Tesla V100-SXM2-16GB [01/30/2022-19:17:13] [I] Compute Capability: 7.0 [01/30/2022-19:17:13] [I] SMs: 80 [01/30/2022-19:17:13] [I] Compute Clock Rate: 1.53 GHz [01/30/2022-19:17:13] [I] Device Global Memory: 16160 MiB [01/30/2022-19:17:13] [I] Shared Memory per SM: 96 KiB [01/30/2022-19:17:13] [I] Memory Bus Width: 4096 bits (ECC enabled) [01/30/2022-19:17:13] [I] Memory Clock Rate: 0.877 GHz [01/30/2022-19:17:13] [I] [01/30/2022-19:17:13] [I] TensorRT version: 8.2.1 [01/30/2022-19:17:14] [I] [TRT] [MemUsageChange] Init CUDA: CPU +260, GPU +0, now: CPU 272, GPU 491 (MiB) [01/30/2022-19:17:15] [I] [TRT] [MemUsageSnapshot] Begin constructing builder kernel library: CPU 272 MiB, GPU 491 MiB [01/30/2022-19:17:15] [I] [TRT] [MemUsageSnapshot] End constructing builder kernel library: CPU 383 MiB, GPU 515 MiB [01/30/2022-19:17:15] [I] Start parsing network model [libprotobuf WARNING /home/jenkins/agent/workspace/OSS/OSS_L0_MergeRequest/oss/build/third_party.protobuf/src/third_party.protobuf/src/google/protobuf/io/coded_stream.cc:604] Reading dangerously large protocol message. If the message turns out to be larger than 2147483647 bytes, parsing will be halted for security reasons. To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h. [libprotobuf WARNING /home/jenkins/agent/workspace/OSS/OSS_L0_MergeRequest/oss/build/third_party.protobuf/src/third_party.protobuf/src/google/protobuf/io/coded_stream.cc:81] The total number of bytes read was 1864920177 [01/30/2022-19:17:17] [I] [TRT] ---------------------------------------------------------------- [01/30/2022-19:17:17] [I] [TRT] Input filename: ./onnx_models/decoder_lm_xsum_0129.onnx [01/30/2022-19:17:17] [I] [TRT] ONNX IR version: 0.0.7 [01/30/2022-19:17:17] [I] [TRT] Opset version: 12 [01/30/2022-19:17:17] [I] [TRT] Producer name: pytorch [01/30/2022-19:17:17] [I] [TRT] Producer version: 1.10 [01/30/2022-19:17:17] [I] [TRT] Domain: [01/30/2022-19:17:17] [I] [TRT] Model version: 0 [01/30/2022-19:17:17] [I] [TRT] Doc string: [01/30/2022-19:17:17] [I] [TRT] ---------------------------------------------------------------- [libprotobuf WARNING /home/jenkins/agent/workspace/OSS/OSS_L0_MergeRequest/oss/build/third_party.protobuf/src/third_party.protobuf/src/google/protobuf/io/coded_stream.cc:604] Reading dangerously large protocol message. If the message turns out to be larger than 2147483647 bytes, parsing will be halted for security reasons. To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h. [libprotobuf WARNING /home/jenkins/agent/workspace/OSS/OSS_L0_MergeRequest/oss/build/third_party.protobuf/src/third_party.protobuf/src/google/protobuf/io/coded_stream.cc:81] The total number of bytes read was 1864920177 [01/30/2022-19:17:21] [W] [TRT] parsers/onnx/onnx2trt_utils.cpp:364: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32. [01/30/2022-19:17:21] [W] [TRT] parsers/onnx/onnx2trt_utils.cpp:392: One or more weights outside the range of INT32 was clamped [01/30/2022-19:17:21] [W] [TRT] parsers/onnx/onnx2trt_utils.cpp:392: One or more weights outside the range of INT32 was clamped [01/30/2022-19:25:13] [W] [TRT] Output type must be INT32 for shape outputs [01/30/2022-19:25:13] [W] [TRT] Output type must be INT32 for shape outputs [01/30/2022-19:25:13] [I] Finish parsing network model [01/30/2022-19:25:20] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +376, GPU +176, now: CPU 2566, GPU 697 (MiB) [01/30/2022-19:25:20] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +116, GPU +52, now: CPU 2682, GPU 749 (MiB) [01/30/2022-19:25:20] [I] [TRT] Local timing cache in use. Profiling results in this builder pass will not be stored. [01/30/2022-19:25:20] [W] [TRT] Myelin graph with multiple dynamic values may have poor performance if they differ. Dynamic values are: [01/30/2022-19:25:20] [W] [TRT] (# 1 (SHAPE encoder_hidden_states)) [01/30/2022-19:25:20] [W] [TRT] (# 1 (SHAPE input_ids)) [01/30/2022-19:49:31] [I] [TRT] Detected 2 inputs and 2 output network tensors. [01/30/2022-19:49:31] [W] [TRT] Myelin graph with multiple dynamic values may have poor performance if they differ. Dynamic values are: [01/30/2022-19:49:31] [W] [TRT] (# 1 (SHAPE encoder_hidden_states)) [01/30/2022-19:49:31] [W] [TRT] (# 1 (SHAPE input_ids)) [01/30/2022-19:50:11] [I] [TRT] Total Host Persistent Memory: 304 [01/30/2022-19:50:11] [I] [TRT] Total Device Persistent Memory: 0 [01/30/2022-19:50:11] [I] [TRT] Total Scratch Memory: 695640064 [01/30/2022-19:50:11] [I] [TRT] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 0 MiB, GPU 2450 MiB [01/30/2022-19:50:11] [I] [TRT] [BlockAssignment] Algorithm ShiftNTopDown took 0.074444ms to assign 6 blocks to 7 nodes requiring 696033792 bytes. [01/30/2022-19:50:11] [I] [TRT] Total Activation Memory: 696033792 [01/30/2022-19:50:11] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 4469, GPU 2817 (MiB) [01/30/2022-19:50:11] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 4469, GPU 2825 (MiB) [01/30/2022-19:50:11] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +0, GPU +2048, now: CPU 0, GPU 2048 (MiB) [01/30/2022-19:50:14] [I] [TRT] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 6244, GPU 743 (MiB) [01/30/2022-19:50:14] [I] [TRT] Loaded engine size: 3564 MiB [01/30/2022-19:50:16] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +10, now: CPU 8030, GPU 2531 (MiB) [01/30/2022-19:50:16] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +1, GPU +8, now: CPU 8031, GPU 2539 (MiB) [01/30/2022-19:50:16] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +1777, now: CPU 0, GPU 1777 (MiB) [01/30/2022-19:50:36] [I] Engine built in 2002.49 sec. [01/30/2022-19:50:36] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +10, now: CPU 2566, GPU 2521 (MiB) [01/30/2022-19:50:36] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +1, GPU +8, now: CPU 2567, GPU 2529 (MiB) [01/30/2022-19:50:46] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +664, now: CPU 0, GPU 2441 (MiB) [01/30/2022-19:50:46] [I] Using random values for input input_ids [01/30/2022-19:50:46] [I] Created input binding for input_ids with dimensions 1x256 [01/30/2022-19:50:46] [I] Using random values for input encoder_hidden_states [01/30/2022-19:50:46] [I] Created input binding for encoder_hidden_states with dimensions 1x256x1024 [01/30/2022-19:50:46] [I] Using random values for output log_softmax [01/30/2022-19:50:46] [I] Created output binding for log_softmax with dimensions 1x5 [01/30/2022-19:50:46] [I] Using random values for output indices [01/30/2022-19:50:46] [I] Created output binding for indices with dimensions 1x5 [01/30/2022-19:50:46] [I] Starting inference [01/30/2022-19:50:49] [I] Warmup completed 11 queries over 200 ms [01/30/2022-19:50:49] [I] Timing trace has 166 queries over 3.04255 s [01/30/2022-19:50:49] [I] [01/30/2022-19:50:49] [I] === Trace details === [01/30/2022-19:50:49] [I] Trace averages of 10 runs: [01/30/2022-19:50:49] [I] Average on 10 runs - GPU latency: 18.0577 ms - Host latency: 18.1715 ms (end to end 18.1814 ms, enqueue 18.1352 ms) [01/30/2022-19:50:49] [I] Average on 10 runs - GPU latency: 18.113 ms - Host latency: 18.2267 ms (end to end 18.2377 ms, enqueue 18.188 ms) [01/30/2022-19:50:49] [I] Average on 10 runs - GPU latency: 18.2543 ms - Host latency: 18.3677 ms (end to end 18.3772 ms, enqueue 18.329 ms) [01/30/2022-19:50:49] [I] Average on 10 runs - GPU latency: 18.1894 ms - Host latency: 18.306 ms (end to end 18.3164 ms, enqueue 18.2637 ms) [01/30/2022-19:50:49] [I] Average on 10 runs - GPU latency: 18.0986 ms - Host latency: 18.2125 ms (end to end 18.2238 ms, enqueue 18.176 ms) [01/30/2022-19:50:49] [I] Average on 10 runs - GPU latency: 18.1058 ms - Host latency: 18.2197 ms (end to end 18.2294 ms, enqueue 18.1842 ms) [01/30/2022-19:50:49] [I] Average on 10 runs - GPU latency: 18.102 ms - Host latency: 18.2153 ms (end to end 18.2243 ms, enqueue 18.1766 ms) [01/30/2022-19:50:49] [I] Average on 10 runs - GPU latency: 18.2453 ms - Host latency: 18.3612 ms (end to end 18.3709 ms, enqueue 18.3231 ms) [01/30/2022-19:50:49] [I] Average on 10 runs - GPU latency: 18.3531 ms - Host latency: 18.4681 ms (end to end 18.4782 ms, enqueue 18.4298 ms) [01/30/2022-19:50:49] [I] Average on 10 runs - GPU latency: 18.1252 ms - Host latency: 18.2395 ms (end to end 18.2493 ms, enqueue 18.2017 ms) [01/30/2022-19:50:49] [I] Average on 10 runs - GPU latency: 18.2229 ms - Host latency: 18.3372 ms (end to end 18.347 ms, enqueue 18.3002 ms) [01/30/2022-19:50:49] [I] Average on 10 runs - GPU latency: 18.1404 ms - Host latency: 18.2548 ms (end to end 18.2649 ms, enqueue 18.2169 ms) [01/30/2022-19:50:49] [I] Average on 10 runs - GPU latency: 18.0748 ms - Host latency: 18.1887 ms (end to end 18.1983 ms, enqueue 18.1512 ms) [01/30/2022-19:50:49] [I] Average on 10 runs - GPU latency: 18.1891 ms - Host latency: 18.3029 ms (end to end 18.3137 ms, enqueue 18.2645 ms) [01/30/2022-19:50:49] [I] Average on 10 runs - GPU latency: 18.3001 ms - Host latency: 18.4143 ms (end to end 18.4241 ms, enqueue 18.376 ms) [01/30/2022-19:50:49] [I] Average on 10 runs - GPU latency: 18.3034 ms - Host latency: 18.4172 ms (end to end 18.427 ms, enqueue 18.3782 ms) [01/30/2022-19:50:49] [I] [01/30/2022-19:50:49] [I] === Performance summary === [01/30/2022-19:50:49] [I] Throughput: 54.5595 qps [01/30/2022-19:50:49] [I] Latency: min = 18.0961 ms, max = 18.5746 ms, mean = 18.3004 ms, median = 18.2987 ms, percentile(99%) = 18.5564 ms [01/30/2022-19:50:49] [I] End-to-End Host Latency: min = 18.1052 ms, max = 18.5831 ms, mean = 18.3104 ms, median = 18.3102 ms, percentile(99%) = 18.5717 ms [01/30/2022-19:50:49] [I] Enqueue Time: min = 18.0611 ms, max = 18.5327 ms, mean = 18.2623 ms, median = 18.2607 ms, percentile(99%) = 18.5215 ms [01/30/2022-19:50:49] [I] H2D Latency: min = 0.105103 ms, max = 0.134399 ms, mean = 0.106851 ms, median = 0.106628 ms, percentile(99%) = 0.114502 ms [01/30/2022-19:50:49] [I] GPU Compute Time: min = 17.9825 ms, max = 18.4597 ms, mean = 18.1861 ms, median = 18.1837 ms, percentile(99%) = 18.4423 ms [01/30/2022-19:50:49] [I] D2H Latency: min = 0.0065918 ms, max = 0.0241699 ms, mean = 0.00739068 ms, median = 0.00732422 ms, percentile(99%) = 0.0115356 ms [01/30/2022-19:50:49] [I] Total Host Walltime: 3.04255 s [01/30/2022-19:50:49] [I] Total GPU Compute Time: 3.01889 s [01/30/2022-19:50:49] [W] * Throughput may be bound by Enqueue Time rather than GPU Compute and the GPU may be under-utilized. [01/30/2022-19:50:49] [W] If not already in use, --useCudaGraph (utilize CUDA graphs where possible) may increase the throughput. [01/30/2022-19:50:49] [I] Explanations of the performance metrics are printed in the verbose logs. [01/30/2022-19:50:49] [I] &&&& PASSED TensorRT.trtexec [TensorRT v8201] # trtexec --onnx=./onnx_models/decoder_lm_xsum_0129.onnx --saveEngine=./onnx_models/trt/decoder_xsum_0129.trt --minShapes=input_ids:1x1,encoder_hidden_states:1x1x1024 --optShapes=input_ids:1x256,encoder_hidden_states:1x256x1024 --maxShapes=input_ids:1x1024,encoder_hidden_states:1x1024x1024 --workspace=4000