srun: job 7619818 queued and waiting for resources srun: job 7619818 has been allocated resources &&&& RUNNING TensorRT.trtexec [TensorRT v8001] # /home/storage08/yanzhao/tools/TensorRT-8.0.1.6/bin/trtexec --verbose --loadEngine=init.trt --shapes=input:4x19x80,input_lens:4,subsampling_cache:4x40x384,elayers_output_cache:12x4x40x384,conformer_cnn_cache:12x4x384x14,masks_cache:4x1x40,offset:4x1 --loadInputs=input:input.bin,input_lens:input_lens.bin,subsampling_cache:subsampling_cache.bin,elayers_output_cache:elayers_output_cache.bin,conformer_cnn_cache:conformer_cnn_cache.bin,masks_cache:masks_cache.bin,offset:offset.bin --exportOutput=trtinfer-init.trt.json [02/28/2022-12:02:18] [I] === Model Options === [02/28/2022-12:02:18] [I] Format: * [02/28/2022-12:02:18] [I] Model: [02/28/2022-12:02:18] [I] Output: [02/28/2022-12:02:18] [I] === Build Options === [02/28/2022-12:02:18] [I] Max batch: explicit [02/28/2022-12:02:18] [I] Workspace: 16 MiB [02/28/2022-12:02:18] [I] minTiming: 1 [02/28/2022-12:02:18] [I] avgTiming: 8 [02/28/2022-12:02:18] [I] Precision: FP32 [02/28/2022-12:02:18] [I] Calibration: [02/28/2022-12:02:18] [I] Refit: Disabled [02/28/2022-12:02:18] [I] Sparsity: Disabled [02/28/2022-12:02:18] [I] Safe mode: Disabled [02/28/2022-12:02:18] [I] Restricted mode: Disabled [02/28/2022-12:02:18] [I] Save engine: [02/28/2022-12:02:18] [I] Load engine: init.trt [02/28/2022-12:02:18] [I] NVTX verbosity: 0 [02/28/2022-12:02:18] [I] Tactic sources: Using default tactic sources [02/28/2022-12:02:18] [I] timingCacheMode: local [02/28/2022-12:02:18] [I] timingCacheFile: [02/28/2022-12:02:18] [I] Input(s)s format: fp32:CHW [02/28/2022-12:02:18] [I] Output(s)s format: fp32:CHW [02/28/2022-12:02:18] [I] Input build shape: subsampling_cache=4x40x384+4x40x384+4x40x384 [02/28/2022-12:02:18] [I] Input build shape: input=4x19x80+4x19x80+4x19x80 [02/28/2022-12:02:18] [I] Input build shape: elayers_output_cache=12x4x40x384+12x4x40x384+12x4x40x384 [02/28/2022-12:02:18] [I] Input build shape: offset=4x1+4x1+4x1 [02/28/2022-12:02:18] [I] Input build shape: conformer_cnn_cache=12x4x384x14+12x4x384x14+12x4x384x14 [02/28/2022-12:02:18] [I] Input build shape: masks_cache=4x1x40+4x1x40+4x1x40 [02/28/2022-12:02:18] [I] Input build shape: input_lens=4+4+4 [02/28/2022-12:02:18] [I] Input calibration shapes: model [02/28/2022-12:02:18] [I] === System Options === [02/28/2022-12:02:18] [I] Device: 0 [02/28/2022-12:02:18] [I] DLACore: [02/28/2022-12:02:18] [I] Plugins: [02/28/2022-12:02:18] [I] === Inference Options === [02/28/2022-12:02:18] [I] Batch: Explicit [02/28/2022-12:02:18] [I] Input inference shape: offset=4x1 [02/28/2022-12:02:18] [I] Input inference shape: masks_cache=4x1x40 [02/28/2022-12:02:18] [I] Input inference shape: conformer_cnn_cache=12x4x384x14 [02/28/2022-12:02:18] [I] Input inference shape: input_lens=4 [02/28/2022-12:02:18] [I] Input inference shape: elayers_output_cache=12x4x40x384 [02/28/2022-12:02:18] [I] Input inference shape: input=4x19x80 [02/28/2022-12:02:18] [I] Input inference shape: subsampling_cache=4x40x384 [02/28/2022-12:02:18] [I] Iterations: 10 [02/28/2022-12:02:18] [I] Duration: 3s (+ 200ms warm up) [02/28/2022-12:02:18] [I] Sleep time: 0ms [02/28/2022-12:02:18] [I] Streams: 1 [02/28/2022-12:02:18] [I] ExposeDMA: Disabled [02/28/2022-12:02:18] [I] Data transfers: Enabled [02/28/2022-12:02:18] [I] Spin-wait: Disabled [02/28/2022-12:02:18] [I] Multithreading: Disabled [02/28/2022-12:02:18] [I] CUDA Graph: Disabled [02/28/2022-12:02:18] [I] Separate profiling: Disabled [02/28/2022-12:02:18] [I] Time Deserialize: Disabled [02/28/2022-12:02:18] [I] Time Refit: Disabled [02/28/2022-12:02:18] [I] Skip inference: Disabled [02/28/2022-12:02:18] [I] Inputs: [02/28/2022-12:02:18] [I] offset<-offset.bin [02/28/2022-12:02:18] [I] masks_cache<-masks_cache.bin [02/28/2022-12:02:18] [I] conformer_cnn_cache<-conformer_cnn_cache.bin [02/28/2022-12:02:18] [I] input_lens<-input_lens.bin [02/28/2022-12:02:18] [I] elayers_output_cache<-elayers_output_cache.bin [02/28/2022-12:02:18] [I] input<-input.bin [02/28/2022-12:02:18] [I] subsampling_cache<-subsampling_cache.bin [02/28/2022-12:02:18] [I] === Reporting Options === [02/28/2022-12:02:18] [I] Verbose: Enabled [02/28/2022-12:02:18] [I] Averages: 10 inferences [02/28/2022-12:02:18] [I] Percentile: 99 [02/28/2022-12:02:18] [I] Dump refittable layers:Disabled [02/28/2022-12:02:18] [I] Dump output: Disabled [02/28/2022-12:02:18] [I] Profile: Disabled [02/28/2022-12:02:18] [I] Export timing to JSON file: [02/28/2022-12:02:18] [I] Export output to JSON file: trtinfer-init.trt.json [02/28/2022-12:02:18] [I] Export profile to JSON file: [02/28/2022-12:02:18] [I] [02/28/2022-12:02:18] [I] === Device Information === [02/28/2022-12:02:18] [I] Selected Device: Tesla V100-PCIE-32GB [02/28/2022-12:02:18] [I] Compute Capability: 7.0 [02/28/2022-12:02:18] [I] SMs: 80 [02/28/2022-12:02:18] [I] Compute Clock Rate: 1.38 GHz [02/28/2022-12:02:18] [I] Device Global Memory: 32510 MiB [02/28/2022-12:02:18] [I] Shared Memory per SM: 96 KiB [02/28/2022-12:02:18] [I] Memory Bus Width: 4096 bits (ECC enabled) [02/28/2022-12:02:18] [I] Memory Clock Rate: 0.877 GHz [02/28/2022-12:02:18] [I] [02/28/2022-12:02:18] [I] TensorRT version: 8001 [02/28/2022-12:02:18] [V] [TRT] Registered plugin creator - ::GridAnchor_TRT version 1 [02/28/2022-12:02:18] [V] [TRT] Registered plugin creator - ::GridAnchorRect_TRT version 1 [02/28/2022-12:02:18] [V] [TRT] Registered plugin creator - ::NMS_TRT version 1 [02/28/2022-12:02:18] [V] [TRT] Registered plugin creator - ::Reorg_TRT version 1 [02/28/2022-12:02:18] [V] [TRT] Registered plugin creator - ::Region_TRT version 1 [02/28/2022-12:02:18] [V] [TRT] Registered plugin creator - ::Clip_TRT version 1 [02/28/2022-12:02:18] [V] [TRT] Registered plugin creator - ::LReLU_TRT version 1 [02/28/2022-12:02:18] [V] [TRT] Registered plugin creator - ::PriorBox_TRT version 1 [02/28/2022-12:02:18] [V] [TRT] Registered plugin creator - ::Normalize_TRT version 1 [02/28/2022-12:02:18] [V] [TRT] Registered plugin creator - ::ScatterND version 1 [02/28/2022-12:02:18] [V] [TRT] Registered plugin creator - ::RPROI_TRT version 1 [02/28/2022-12:02:18] [V] [TRT] Registered plugin creator - ::BatchedNMS_TRT version 1 [02/28/2022-12:02:18] [V] [TRT] Registered plugin creator - ::BatchedNMSDynamic_TRT version 1 [02/28/2022-12:02:18] [V] [TRT] Registered plugin creator - ::FlattenConcat_TRT version 1 [02/28/2022-12:02:18] [V] [TRT] Registered plugin creator - ::CropAndResize version 1 [02/28/2022-12:02:18] [V] [TRT] Registered plugin creator - ::DetectionLayer_TRT version 1 [02/28/2022-12:02:18] [V] [TRT] Registered plugin creator - ::EfficientNMS_ONNX_TRT version 1 [02/28/2022-12:02:18] [V] [TRT] Registered plugin creator - ::EfficientNMS_TRT version 1 [02/28/2022-12:02:18] [V] [TRT] Registered plugin creator - ::Proposal version 1 [02/28/2022-12:02:18] [V] [TRT] Registered plugin creator - ::ProposalLayer_TRT version 1 [02/28/2022-12:02:18] [V] [TRT] Registered plugin creator - ::PyramidROIAlign_TRT version 1 [02/28/2022-12:02:18] [V] [TRT] Registered plugin creator - ::ResizeNearest_TRT version 1 [02/28/2022-12:02:18] [V] [TRT] Registered plugin creator - ::Split version 1 [02/28/2022-12:02:18] [V] [TRT] Registered plugin creator - ::SpecialSlice_TRT version 1 [02/28/2022-12:02:18] [V] [TRT] Registered plugin creator - ::InstanceNormalization_TRT version 1 [02/28/2022-12:02:20] [I] [TRT] [MemUsageChange] Init CUDA: CPU +276, GPU +0, now: CPU 527, GPU 507 (MiB) [02/28/2022-12:02:20] [I] [TRT] Loaded engine size: 265 MB [02/28/2022-12:02:20] [I] [TRT] [MemUsageSnapshot] deserializeCudaEngine begin: CPU 527 MiB, GPU 507 MiB [02/28/2022-12:02:22] [V] [TRT] Using cublasLt a tactic source [02/28/2022-12:02:22] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +355, GPU +160, now: CPU 912, GPU 905 (MiB) [02/28/2022-12:02:22] [V] [TRT] Using cuDNN as a tactic source [02/28/2022-12:02:23] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +271, GPU +132, now: CPU 1183, GPU 1037 (MiB) [W] [TRT] TensorRT was linked against cuDNN 8.2.1 but loaded cuDNN 8.2.0 [02/28/2022-12:02:23] [02/28/2022-12:02:23] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 1182, GPU 1019 (MiB) [02/28/2022-12:02:23] [V] [TRT] Deserialization required 2814495 microseconds. [02/28/2022-12:02:23] [I] [TRT] [MemUsageSnapshot] deserializeCudaEngine end: CPU 1184 MiB, GPU 1019 MiB [02/28/2022-12:02:23] [I] Engine loaded in 5.01878 sec. [02/28/2022-12:02:23] [I] [TRT] [MemUsageSnapshot] ExecutionContext creation begin: CPU 885 MiB, GPU 1019 MiB [02/28/2022-12:02:23] [V] [TRT] Using cublasLt a tactic source [02/28/2022-12:02:23] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +10, now: CPU 918, GPU 1029 (MiB) [02/28/2022-12:02:23] [V] [TRT] Using cuDNN as a tactic source [02/28/2022-12:02:23] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 918, GPU 1037 (MiB) [W] [TRT] TensorRT was linked against cuDNN 8.2.1 but loaded cuDNN 8.2.0 [02/28/2022-12:02:23] [02/28/2022-12:02:23] [V] [TRT] Total per-runner device memory is 19527168 [02/28/2022-12:02:23] [V] [TRT] Total per-runner host memory is 131920 [02/28/2022-12:02:23] [V] [TRT] Allocated activation device memory of size 12649472 [02/28/2022-12:02:23] [V] [TRT] myelinAllocCb allocated GPU (data-constants) 304 bytes at 0x7f70ceabd700. [02/28/2022-12:02:23] [V] [TRT] myelinAllocCb allocated GPU (data-constants) 8 bytes at 0x7f70ceabd900. [02/28/2022-12:02:23] [V] [TRT] myelinAllocCb allocated GPU (data-constants) 8 bytes at 0x7f70ceabd900. [02/28/2022-12:02:23] [V] [TRT] myelinAllocCb allocated GPU (data-constants) 8 bytes at 0x7f70ceabd900. [02/28/2022-12:02:24] [V] [TRT] myelinAllocCb allocated GPU (data-constants) 8 bytes at 0x7f70ceabd900. [02/28/2022-12:02:24] [V] [TRT] myelinAllocCb allocated GPU (data-constants) 8 bytes at 0x7f70ceabd900. [02/28/2022-12:02:24] [V] [TRT] myelinAllocCb allocated GPU (data-constants) 8 bytes at 0x7f70ceabd900. [02/28/2022-12:02:24] [V] [TRT] myelinAllocCb allocated GPU (data-constants) 8 bytes at 0x7f70ceabd900. [02/28/2022-12:02:24] [V] [TRT] myelinAllocCb allocated GPU (data-constants) 8 bytes at 0x7f70ceabd900. [02/28/2022-12:02:24] [V] [TRT] myelinAllocCb allocated GPU (data-constants) 8 bytes at 0x7f70ceabd900. [02/28/2022-12:02:24] [V] [TRT] myelinAllocCb allocated GPU (data-constants) 8 bytes at 0x7f70ceabd900. [02/28/2022-12:02:24] [V] [TRT] myelinAllocCb allocated GPU (data-constants) 8 bytes at 0x7f70ceabd900. [02/28/2022-12:02:24] [V] [TRT] myelinAllocCb allocated GPU (data-constants) 8 bytes at 0x7f70ceabd900. [02/28/2022-12:02:24] [I] [TRT] [MemUsageSnapshot] ExecutionContext creation end: CPU 1106 MiB, GPU 1147 MiB [02/28/2022-12:02:24] [I] Created input binding for input with dimensions 4x19x80 [02/28/2022-12:02:24] [I] Created input binding for input_lens with dimensions 4 [02/28/2022-12:02:24] [I] Created input binding for subsampling_cache with dimensions 4x40x384 [02/28/2022-12:02:24] [I] Created input binding for elayers_output_cache with dimensions 12x4x40x384 [02/28/2022-12:02:24] [I] Created input binding for conformer_cnn_cache with dimensions 12x4x384x14 [02/28/2022-12:02:24] [I] Created input binding for masks_cache with dimensions 4x1x40 [02/28/2022-12:02:24] [I] Created input binding for offset with dimensions 4x1 [02/28/2022-12:02:24] [I] Created output binding for r_subsampling_cache with dimensions 4x40x384 [02/28/2022-12:02:24] [I] Created output binding for r_mask_cache with dimensions 4x1x40 [02/28/2022-12:02:24] [I] Created output binding for r_elayers_output_cache with dimensions 12x4x40x384 [02/28/2022-12:02:24] [I] Created output binding for r_conformer_cnn_cache with dimensions 12x4x384x14 [02/28/2022-12:02:24] [I] Created output binding for topk_idx with dimensions 4x4x7373 [02/28/2022-12:02:24] [I] Starting inference [02/28/2022-12:02:24] [V] [TRT] myelinAllocCb allocated GPU 1188864 bytes at 0x7f70fc000000. [02/28/2022-12:02:24] [V] [TRT] myelinAllocCb allocated GPU 1188864 bytes at 0x7f70fc000000. [02/28/2022-12:02:24] [V] [TRT] myelinAllocCb allocated GPU 1188864 bytes at 0x7f70fc000000. [02/28/2022-12:02:24] [V] [TRT] myelinAllocCb allocated GPU 1188864 bytes at 0x7f70fc000000. [02/28/2022-12:02:24] [V] [TRT] myelinAllocCb allocated GPU 1188864 bytes at 0x7f70fc000000. [02/28/2022-12:02:24] [V] [TRT] myelinAllocCb allocated GPU 1188864 bytes at 0x7f70fc000000. [02/28/2022-12:02:24] [V] [TRT] myelinAllocCb allocated GPU 1188864 bytes at 0x7f70fc000000. [02/28/2022-12:02:24] [V] [TRT] myelinAllocCb allocated GPU 1188864 bytes at 0x7f70fc000000. [02/28/2022-12:02:24] [V] [TRT] myelinAllocCb allocated GPU 1188864 bytes at 0x7f70fc000000. [02/28/2022-12:02:24] [V] [TRT] myelinAllocCb allocated GPU 1188864 bytes at 0x7f70fc000000. [02/28/2022-12:02:24] [V] [TRT] myelinAllocCb allocated GPU 1188864 bytes at 0x7f70fc000000. [02/28/2022-12:02:24] [V] [TRT] myelinAllocCb allocated GPU 1188864 bytes at 0x7f70fc000000. [02/28/2022-12:02:27] [I] Warmup completed 1 queries over 200 ms [02/28/2022-12:02:27] [I] Timing trace has 302 queries over 2.15169 s [02/28/2022-12:02:27] [I] [02/28/2022-12:02:27] [I] === Trace details === [02/28/2022-12:02:27] [I] Trace averages of 10 runs: [02/28/2022-12:02:27] [I] Average on 10 runs - GPU latency: 7.30028 ms - Host latency: 8.15863 ms (end to end 8.17998 ms, enqueue 7.56901 ms) [02/28/2022-12:02:27] [I] Average on 10 runs - GPU latency: 7.2183 ms - Host latency: 8.06896 ms (end to end 8.08929 ms, enqueue 7.45851 ms) [02/28/2022-12:02:27] [I] Average on 10 runs - GPU latency: 6.91166 ms - Host latency: 7.76505 ms (end to end 7.78427 ms, enqueue 7.17786 ms) [02/28/2022-12:02:27] [I] Average on 10 runs - GPU latency: 6.75851 ms - Host latency: 7.61512 ms (end to end 7.63713 ms, enqueue 7.0468 ms) [02/28/2022-12:02:27] [I] Average on 10 runs - GPU latency: 6.74225 ms - Host latency: 7.6 ms (end to end 7.61842 ms, enqueue 7.03091 ms) [02/28/2022-12:02:27] [I] Average on 10 runs - GPU latency: 6.6733 ms - Host latency: 7.5288 ms (end to end 7.54813 ms, enqueue 6.95182 ms) [02/28/2022-12:02:27] [I] Average on 10 runs - GPU latency: 6.90155 ms - Host latency: 7.75646 ms (end to end 7.77716 ms, enqueue 7.18082 ms) [02/28/2022-12:02:27] [I] Average on 10 runs - GPU latency: 6.77377 ms - Host latency: 7.6324 ms (end to end 7.65105 ms, enqueue 7.07151 ms) [02/28/2022-12:02:27] [I] Average on 10 runs - GPU latency: 6.64618 ms - Host latency: 7.50522 ms (end to end 7.52563 ms, enqueue 6.92916 ms) [02/28/2022-12:02:27] [I] Average on 10 runs - GPU latency: 6.8139 ms - Host latency: 7.67201 ms (end to end 7.69208 ms, enqueue 7.10083 ms) [02/28/2022-12:02:27] [I] Average on 10 runs - GPU latency: 6.73319 ms - Host latency: 7.59465 ms (end to end 7.61615 ms, enqueue 7.04513 ms) [02/28/2022-12:02:27] [I] Average on 10 runs - GPU latency: 6.76937 ms - Host latency: 7.62935 ms (end to end 7.64938 ms, enqueue 7.06959 ms) [02/28/2022-12:02:27] [I] Average on 10 runs - GPU latency: 6.70219 ms - Host latency: 7.56111 ms (end to end 7.58243 ms, enqueue 7.00225 ms) [02/28/2022-12:02:27] [I] Average on 10 runs - GPU latency: 6.71171 ms - Host latency: 7.57076 ms (end to end 7.59119 ms, enqueue 7.00294 ms) [02/28/2022-12:02:27] [I] Average on 10 runs - GPU latency: 6.92336 ms - Host latency: 7.78335 ms (end to end 7.80222 ms, enqueue 7.21394 ms) [02/28/2022-12:02:27] [I] Average on 10 runs - GPU latency: 6.64031 ms - Host latency: 7.50054 ms (end to end 7.52 ms, enqueue 6.91873 ms) [02/28/2022-12:02:27] [I] Average on 10 runs - GPU latency: 6.65552 ms - Host latency: 7.51357 ms (end to end 7.53464 ms, enqueue 6.93804 ms) [02/28/2022-12:02:27] [I] Average on 10 runs - GPU latency: 6.93208 ms - Host latency: 7.78987 ms (end to end 7.8106 ms, enqueue 7.21743 ms) [02/28/2022-12:02:27] [I] Average on 10 runs - GPU latency: 6.62034 ms - Host latency: 7.47761 ms (end to end 7.49661 ms, enqueue 6.90161 ms) [02/28/2022-12:02:27] [I] Average on 10 runs - GPU latency: 6.6426 ms - Host latency: 7.49976 ms (end to end 7.51746 ms, enqueue 6.92117 ms) [02/28/2022-12:02:27] [I] Average on 10 runs - GPU latency: 6.63574 ms - Host latency: 7.49211 ms (end to end 7.51099 ms, enqueue 6.91501 ms) [02/28/2022-12:02:27] [I] Average on 10 runs - GPU latency: 6.89504 ms - Host latency: 7.7533 ms (end to end 7.77488 ms, enqueue 7.18025 ms) [02/28/2022-12:02:27] [I] Average on 10 runs - GPU latency: 6.63799 ms - Host latency: 7.49519 ms (end to end 7.51658 ms, enqueue 6.91648 ms) [02/28/2022-12:02:27] [I] Average on 10 runs - GPU latency: 6.63477 ms - Host latency: 7.49651 ms (end to end 7.51589 ms, enqueue 6.91948 ms) [02/28/2022-12:02:27] [I] Average on 10 runs - GPU latency: 6.90891 ms - Host latency: 7.77058 ms (end to end 7.79053 ms, enqueue 7.19976 ms) [02/28/2022-12:02:27] [I] Average on 10 runs - GPU latency: 6.63335 ms - Host latency: 7.49631 ms (end to end 7.51633 ms, enqueue 6.91777 ms) [02/28/2022-12:02:27] [I] Average on 10 runs - GPU latency: 6.62571 ms - Host latency: 7.48035 ms (end to end 7.50151 ms, enqueue 6.90601 ms) [02/28/2022-12:02:27] [I] Average on 10 runs - GPU latency: 6.85823 ms - Host latency: 7.71882 ms (end to end 7.73696 ms, enqueue 7.14199 ms) [02/28/2022-12:02:27] [I] Average on 10 runs - GPU latency: 6.63821 ms - Host latency: 7.49695 ms (end to end 7.51633 ms, enqueue 6.92 ms) [02/28/2022-12:02:27] [I] Average on 10 runs - GPU latency: 6.63862 ms - Host latency: 7.49687 ms (end to end 7.51648 ms, enqueue 6.92056 ms) [02/28/2022-12:02:27] [I] [02/28/2022-12:02:27] [I] === Performance summary === [02/28/2022-12:02:27] [I] Throughput: 140.354 qps [02/28/2022-12:02:27] [I] Latency: min = 7.36816 ms, max = 10.5356 ms, mean = 7.62932 ms, median = 7.5097 ms, percentile(99%) = 10.1304 ms [02/28/2022-12:02:27] [I] End-to-End Host Latency: min = 7.38403 ms, max = 10.5505 ms, mean = 7.64931 ms, median = 7.53174 ms, percentile(99%) = 10.1453 ms [02/28/2022-12:02:27] [I] Enqueue Time: min = 6.8667 ms, max = 10.0222 ms, mean = 7.05518 ms, median = 6.93567 ms, percentile(99%) = 9.57861 ms [02/28/2022-12:02:27] [I] H2D Latency: min = 0.391235 ms, max = 0.42627 ms, mean = 0.396532 ms, median = 0.395874 ms, percentile(99%) = 0.407715 ms [02/28/2022-12:02:27] [I] GPU Compute Time: min = 6.60156 ms, max = 9.67358 ms, mean = 6.77153 ms, median = 6.65088 ms, percentile(99%) = 9.27539 ms [02/28/2022-12:02:27] [I] D2H Latency: min = 0.370117 ms, max = 0.478027 ms, mean = 0.461261 ms, median = 0.461426 ms, percentile(99%) = 0.474121 ms [02/28/2022-12:02:27] [I] Total Host Walltime: 2.15169 s [02/28/2022-12:02:27] [I] Total GPU Compute Time: 2.045 s [W] * Throughput may be bound by Enqueue Time rather than GPU Compute and the GPU may be under-utilized. [02/28/2022-12:02:27] [02/28/2022-12:02:27] [02/28/2022-12:02:27] [I] Explanations of the performance metrics are printed in the verbose logs. [W] If not already in use, --useCudaGraph (utilize CUDA graphs where possible) may increase the throughput. [02/28/2022-12:02:27] [V] [02/28/2022-12:02:27] [V] === Explanations of the performance metrics === [02/28/2022-12:02:27] [V] Total Host Walltime: the host walltime from when the first query (after warmups) is enqueued to when the last query is completed. [02/28/2022-12:02:27] [V] GPU Compute Time: the GPU latency to execute the kernels for a query. [02/28/2022-12:02:27] [V] Total GPU Compute Time: the summation of the GPU Compute Time of all the queries. If this is significantly shorter than Total Host Walltime, the GPU may be under-utilized because of host-side overheads or data transfers. [02/28/2022-12:02:27] [V] Throughput: the observed throughput computed by dividing the number of queries by the Total Host Walltime. If this is significantly lower than the reciprocal of GPU Compute Time, the GPU may be under-utilized because of host-side overheads or data transfers. [02/28/2022-12:02:27] [V] Enqueue Time: the host latency to enqueue a query. If this is longer than GPU Compute Time, the GPU may be under-utilized. [02/28/2022-12:02:27] [V] H2D Latency: the latency for host-to-device data transfers for input tensors of a single query. [02/28/2022-12:02:27] [V] D2H Latency: the latency for device-to-host data transfers for output tensors of a single query. [02/28/2022-12:02:27] [V] Latency: the summation of H2D Latency, GPU Compute Time, and D2H Latency. This is the latency to infer a single query. [02/28/2022-12:02:27] [V] End-to-End Host Latency: the duration from when the H2D of a query is called to when the D2H of the same query is completed, which includes the latency to wait for the completion of the previous query. This is the latency of a query if multiple queries are enqueued consecutively. [02/28/2022-12:02:27] [I] &&&& PASSED TensorRT.trtexec [TensorRT v8001] # /home/storage08/yanzhao/tools/TensorRT-8.0.1.6/bin/trtexec --verbose --loadEngine=init.trt --shapes=input:4x19x80,input_lens:4,subsampling_cache:4x40x384,elayers_output_cache:12x4x40x384,conformer_cnn_cache:12x4x384x14,masks_cache:4x1x40,offset:4x1 --loadInputs=input:input.bin,input_lens:input_lens.bin,subsampling_cache:subsampling_cache.bin,elayers_output_cache:elayers_output_cache.bin,conformer_cnn_cache:conformer_cnn_cache.bin,masks_cache:masks_cache.bin,offset:offset.bin --exportOutput=trtinfer-init.trt.json [02/28/2022-12:02:27] [V] [TRT] myelinFreeCb freeing GPU at 0x7f70ceabd700. [02/28/2022-12:02:27] [V] [TRT] myelinFreeCb freeing GPU at 0x7f70fc000000. [02/28/2022-12:02:27] [V] [TRT] myelinFreeCb freeing GPU at 0x7f70ceabd900. [02/28/2022-12:02:27] [V] [TRT] myelinFreeCb freeing GPU at 0x7f70fc000000. [02/28/2022-12:02:27] [V] [TRT] myelinFreeCb freeing GPU at 0x7f70ceabd900. [02/28/2022-12:02:27] [V] [TRT] myelinFreeCb freeing GPU at 0x7f70fc000000. [02/28/2022-12:02:27] [V] [TRT] myelinFreeCb freeing GPU at 0x7f70ceabd900. [02/28/2022-12:02:27] [V] [TRT] myelinFreeCb freeing GPU at 0x7f70fc000000. [02/28/2022-12:02:27] [V] [TRT] myelinFreeCb freeing GPU at 0x7f70ceabd900. [02/28/2022-12:02:27] [V] [TRT] myelinFreeCb freeing GPU at 0x7f70fc000000. [02/28/2022-12:02:27] [V] [TRT] myelinFreeCb freeing GPU at 0x7f70ceabd900. [02/28/2022-12:02:27] [V] [TRT] myelinFreeCb freeing GPU at 0x7f70fc000000. [02/28/2022-12:02:27] [V] [TRT] myelinFreeCb freeing GPU at 0x7f70ceabd900. [02/28/2022-12:02:27] [V] [TRT] myelinFreeCb freeing GPU at 0x7f70fc000000. [02/28/2022-12:02:27] [V] [TRT] myelinFreeCb freeing GPU at 0x7f70ceabd900. [02/28/2022-12:02:27] [V] [TRT] myelinFreeCb freeing GPU at 0x7f70fc000000. [02/28/2022-12:02:27] [V] [TRT] myelinFreeCb freeing GPU at 0x7f70ceabd900. [02/28/2022-12:02:27] [V] [TRT] myelinFreeCb freeing GPU at 0x7f70fc000000. [02/28/2022-12:02:27] [V] [TRT] myelinFreeCb freeing GPU at 0x7f70ceabd900. [02/28/2022-12:02:27] [V] [TRT] myelinFreeCb freeing GPU at 0x7f70fc000000. [02/28/2022-12:02:27] [V] [TRT] myelinFreeCb freeing GPU at 0x7f70ceabd900. [02/28/2022-12:02:27] [V] [TRT] myelinFreeCb freeing GPU at 0x7f70fc000000. [02/28/2022-12:02:27] [V] [TRT] myelinFreeCb freeing GPU at 0x7f70ceabd900. [02/28/2022-12:02:27] [V] [TRT] myelinFreeCb freeing GPU at 0x7f70fc000000. [02/28/2022-12:02:27] [V] [TRT] myelinFreeCb freeing GPU at 0x7f70ceabd900. [02/28/2022-12:02:27] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU -7, GPU +0, now: CPU 4294963975, GPU 1235 (MiB)