Description
trtexec build trtengine from onnx model with successful inference benchmark as
export LD_LIBRARY_PATH=/mnt/local-storage/zhuangzhong/TensorRT-10.0.1.6/lib && /mnt/local-storage/zhuangzhong/TensorRT-10.0.1.6/bin/trtexec --onnx=/mnt/local-storage/zhuangzhong/models/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch-split/model.onnx --saveEngine=/mnt/local-storage/zhuangzhong/models/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch-split/model.engine
&&&& RUNNING TensorRT.trtexec [TensorRT v100001] # /mnt/local-storage/zhuangzhong/TensorRT-10.0.1.6/bin/trtexec --onnx=/mnt/local-storage/zhuangzhong/models/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch-split/model.onnx --saveEngine=/mnt/local-storage/zhuangzhong/models/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch-split/model.engine
[05/29/2024-02:54:11] [I] === Model Options ===
[05/29/2024-02:54:11] [I] Format: ONNX
[05/29/2024-02:54:11] [I] Model: /mnt/local-storage/zhuangzhong/models/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch-split/model.onnx
[05/29/2024-02:54:11] [I] Output:
[05/29/2024-02:54:11] [I] === Build Options ===
[05/29/2024-02:54:11] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default, tacticSharedMem: default
[05/29/2024-02:54:11] [I] avgTiming: 8
[05/29/2024-02:54:11] [I] Precision: FP32
[05/29/2024-02:54:11] [I] LayerPrecisions:
[05/29/2024-02:54:11] [I] Layer Device Types:
[05/29/2024-02:54:11] [I] Calibration:
[05/29/2024-02:54:11] [I] Refit: Disabled
[05/29/2024-02:54:11] [I] Strip weights: Disabled
[05/29/2024-02:54:11] [I] Version Compatible: Disabled
[05/29/2024-02:54:11] [I] ONNX Plugin InstanceNorm: Disabled
[05/29/2024-02:54:11] [I] TensorRT runtime: full
[05/29/2024-02:54:11] [I] Lean DLL Path:
[05/29/2024-02:54:11] [I] Tempfile Controls: { in_memory: allow, temporary: allow }
[05/29/2024-02:54:11] [I] Exclude Lean Runtime: Disabled
[05/29/2024-02:54:11] [I] Sparsity: Disabled
[05/29/2024-02:54:11] [I] Safe mode: Disabled
[05/29/2024-02:54:11] [I] Build DLA standalone loadable: Disabled
[05/29/2024-02:54:11] [I] Allow GPU fallback for DLA: Disabled
[05/29/2024-02:54:11] [I] DirectIO mode: Disabled
[05/29/2024-02:54:11] [I] Restricted mode: Disabled
[05/29/2024-02:54:11] [I] Skip inference: Disabled
[05/29/2024-02:54:11] [I] Save engine: /mnt/local-storage/zhuangzhong/models/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch-split/model.engine
[05/29/2024-02:54:11] [I] Load engine:
[05/29/2024-02:54:11] [I] Profiling verbosity: 0
[05/29/2024-02:54:11] [I] Tactic sources: Using default tactic sources
[05/29/2024-02:54:11] [I] timingCacheMode: local
[05/29/2024-02:54:11] [I] timingCacheFile:
[05/29/2024-02:54:11] [I] Enable Compilation Cache: Enabled
[05/29/2024-02:54:11] [I] errorOnTimingCacheMiss: Disabled
[05/29/2024-02:54:11] [I] Preview Features: Use default preview flags.
[05/29/2024-02:54:11] [I] MaxAuxStreams: -1
[05/29/2024-02:54:11] [I] BuilderOptimizationLevel: -1
[05/29/2024-02:54:11] [I] Calibration Profile Index: 0
[05/29/2024-02:54:11] [I] Weight Streaming: Disabled
[05/29/2024-02:54:11] [I] Debug Tensors:
[05/29/2024-02:54:11] [I] Input(s)s format: fp32:CHW
[05/29/2024-02:54:11] [I] Output(s)s format: fp32:CHW
[05/29/2024-02:54:11] [I] Input build shapes: model
[05/29/2024-02:54:11] [I] Input calibration shapes: model
[05/29/2024-02:54:11] [I] === System Options ===
[05/29/2024-02:54:11] [I] Device: 0
[05/29/2024-02:54:11] [I] DLACore:
[05/29/2024-02:54:11] [I] Plugins:
[05/29/2024-02:54:11] [I] setPluginsToSerialize:
[05/29/2024-02:54:11] [I] dynamicPlugins:
[05/29/2024-02:54:11] [I] ignoreParsedPluginLibs: 0
[05/29/2024-02:54:11] [I]
[05/29/2024-02:54:11] [I] === Inference Options ===
[05/29/2024-02:54:11] [I] Batch: Explicit
[05/29/2024-02:54:11] [I] Input inference shapes: model
[05/29/2024-02:54:11] [I] Iterations: 10
[05/29/2024-02:54:11] [I] Duration: 3s (+ 200ms warm up)
[05/29/2024-02:54:11] [I] Sleep time: 0ms
[05/29/2024-02:54:11] [I] Idle time: 0ms
[05/29/2024-02:54:11] [I] Inference Streams: 1
[05/29/2024-02:54:11] [I] ExposeDMA: Disabled
[05/29/2024-02:54:11] [I] Data transfers: Enabled
[05/29/2024-02:54:11] [I] Spin-wait: Disabled
[05/29/2024-02:54:11] [I] Multithreading: Disabled
[05/29/2024-02:54:11] [I] CUDA Graph: Disabled
[05/29/2024-02:54:11] [I] Separate profiling: Disabled
[05/29/2024-02:54:11] [I] Time Deserialize: Disabled
[05/29/2024-02:54:11] [I] Time Refit: Disabled
[05/29/2024-02:54:11] [I] NVTX verbosity: 0
[05/29/2024-02:54:11] [I] Persistent Cache Ratio: 0
[05/29/2024-02:54:11] [I] Optimization Profile Index: 0
[05/29/2024-02:54:11] [I] Weight Streaming Budget: Disabled
[05/29/2024-02:54:11] [I] Inputs:
[05/29/2024-02:54:11] [I] Debug Tensor Save Destinations:
[05/29/2024-02:54:11] [I] === Reporting Options ===
[05/29/2024-02:54:11] [I] Verbose: Disabled
[05/29/2024-02:54:11] [I] Averages: 10 inferences
[05/29/2024-02:54:11] [I] Percentiles: 90,95,99
[05/29/2024-02:54:11] [I] Dump refittable layers:Disabled
[05/29/2024-02:54:11] [I] Dump output: Disabled
[05/29/2024-02:54:11] [I] Profile: Disabled
[05/29/2024-02:54:11] [I] Export timing to JSON file:
[05/29/2024-02:54:11] [I] Export output to JSON file:
[05/29/2024-02:54:11] [I] Export profile to JSON file:
[05/29/2024-02:54:11] [I]
[05/29/2024-02:54:11] [I] === Device Information ===
[05/29/2024-02:54:11] [I] Available Devices:
[05/29/2024-02:54:11] [I] Device 0: “NVIDIA GeForce RTX 4090” UUID: GPU-8070fe18-319a-8921-09f9-7f443122cd47
[05/29/2024-02:54:11] [I] Device 1: “NVIDIA GeForce RTX 4090” UUID: GPU-ff36ee1c-1cfe-9173-fa43-5cdb12d3a8e3
[05/29/2024-02:54:11] [I] Device 2: “NVIDIA GeForce RTX 4090” UUID: GPU-3765fd86-200c-f0c0-cdd0-e1161facb7c7
[05/29/2024-02:54:11] [I] Device 3: “NVIDIA GeForce RTX 4090” UUID: GPU-0cf62d0d-aca1-50db-577f-bd27025a437c
[05/29/2024-02:54:11] [I] Device 4: “NVIDIA GeForce RTX 4090” UUID: GPU-e62b7988-483b-d0c1-75e3-0bfd55cc9b33
[05/29/2024-02:54:11] [I] Device 5: “NVIDIA GeForce RTX 4090” UUID: GPU-7434de8b-1d0e-646e-7f06-a190a9e252ca
[05/29/2024-02:54:11] [I] Device 6: “NVIDIA GeForce RTX 4090” UUID: GPU-efe5534a-7d6e-544c-3de6-548b7e13d73f
[05/29/2024-02:54:11] [I] Device 7: “NVIDIA GeForce RTX 4090” UUID: GPU-79a8d39e-b9cb-d0f5-6c64-ea49a6dac3a0
[05/29/2024-02:54:11] [I] Selected Device: NVIDIA GeForce RTX 4090
[05/29/2024-02:54:11] [I] Selected Device ID: 0
[05/29/2024-02:54:11] [I] Selected Device UUID: GPU-8070fe18-319a-8921-09f9-7f443122cd47
[05/29/2024-02:54:11] [I] Compute Capability: 8.9
[05/29/2024-02:54:11] [I] SMs: 128
[05/29/2024-02:54:11] [I] Device Global Memory: 24217 MiB
[05/29/2024-02:54:11] [I] Shared Memory per SM: 100 KiB
[05/29/2024-02:54:11] [I] Memory Bus Width: 384 bits (ECC disabled)
[05/29/2024-02:54:11] [I] Application Compute Clock Rate: 2.535 GHz
[05/29/2024-02:54:11] [I] Application Memory Clock Rate: 10.501 GHz
[05/29/2024-02:54:11] [I]
[05/29/2024-02:54:11] [I] Note: The application clock rates do not reflect the actual clock rates that the GPU is currently running at.
[05/29/2024-02:54:11] [I]
[05/29/2024-02:54:11] [I] TensorRT version: 10.0.1
[05/29/2024-02:54:11] [I] Loading standard plugins
[05/29/2024-02:54:11] [I] [TRT] [MemUsageChange] Init CUDA: CPU +2, GPU +0, now: CPU 16, GPU 391 (MiB)
[05/29/2024-02:54:12] [I] [TRT] [MemUsageChange] Init builder kernel library: CPU +1760, GPU +314, now: CPU 1912, GPU 705 (MiB)
[05/29/2024-02:54:12] [I] Start parsing network model.
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:604] Reading dangerously large protocol message. If the message turns out to be larger than 2147483647 bytes, parsing will be halted for security reasons. To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:81] The total number of bytes read was 657502289
[05/29/2024-02:54:13] [I] [TRT] ----------------------------------------------------------------
[05/29/2024-02:54:13] [I] [TRT] Input filename: /mnt/local-storage/zhuangzhong/models/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch-split/model.onnx
[05/29/2024-02:54:13] [I] [TRT] ONNX IR version: 0.0.7
[05/29/2024-02:54:13] [I] [TRT] Opset version: 14
[05/29/2024-02:54:13] [I] [TRT] Producer name: pytorch
[05/29/2024-02:54:13] [I] [TRT] Producer version: 2.3.0
[05/29/2024-02:54:13] [I] [TRT] Domain:
[05/29/2024-02:54:13] [I] [TRT] Model version: 0
[05/29/2024-02:54:13] [I] [TRT] Doc string:
[05/29/2024-02:54:13] [I] [TRT] ----------------------------------------------------------------
[05/29/2024-02:54:13] [I] Finished parsing network model. Parse time: 0.632842
[05/29/2024-02:54:13] [W] Dynamic dimensions required for input: speech_lengths, but no shapes were provided. Automatically overriding shape to: 1
[05/29/2024-02:54:13] [I] Set input shape tensor speech_lengths for optimization profile 0 to: MIN=1 OPT=1 MAX=1
[05/29/2024-02:54:13] [W] [TRT] [RemoveDeadLayers] Input Tensor speech_lengths is unused or used only at compile-time, but is not being removed.
[05/29/2024-02:54:13] [I] [TRT] Local timing cache in use. Profiling results in this builder pass will not be stored.
[05/29/2024-02:54:19] [I] [TRT] [GraphReduction] The approximate region cut reduction algorithm is called.
[05/29/2024-02:54:19] [I] [TRT] Detected 2 inputs and 2 output network tensors.
[05/29/2024-02:54:20] [I] [TRT] Total Host Persistent Memory: 268032
[05/29/2024-02:54:20] [I] [TRT] Total Device Persistent Memory: 0
[05/29/2024-02:54:20] [I] [TRT] Total Scratch Memory: 1242112
[05/29/2024-02:54:20] [I] [TRT] [BlockAssignment] Started assigning block shifts. This will take 553 steps to complete.
[05/29/2024-02:54:20] [I] [TRT] [BlockAssignment] Algorithm ShiftNTopDown took 14.7161ms to assign 14 blocks to 553 nodes requiring 2677248 bytes.
[05/29/2024-02:54:20] [I] [TRT] Total Activation Memory: 2676736
[05/29/2024-02:54:20] [I] [TRT] Total Weights Memory: 632017152
[05/29/2024-02:54:20] [I] [TRT] Engine generation completed in 6.94411 seconds.
[05/29/2024-02:54:20] [I] [TRT] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 0 MiB, GPU 603 MiB
[05/29/2024-02:54:20] [I] [TRT] [MemUsageStats] Peak memory usage during Engine building and serialization: CPU: 4363 MiB
[05/29/2024-02:54:20] [I] Engine built in 7.55418 sec.
[05/29/2024-02:54:20] [I] Created engine with size: 630.068 MiB
[05/29/2024-02:54:22] [I] [TRT] Loaded engine size: 630 MiB
[05/29/2024-02:54:22] [I] Engine deserialized in 0.308859 sec.
[05/29/2024-02:54:22] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +3, now: CPU 0, GPU 605 (MiB)
[05/29/2024-02:54:22] [I] Setting persistentCacheLimit to 0 bytes.
[05/29/2024-02:54:22] [W] Values missing for input shape tensor: speech_lengthsAutomatically setting values to: 1
[05/29/2024-02:54:22] [I] Set input shape tensor speech_lengths to: 1
[05/29/2024-02:54:22] [I] Created execution context with device memory size: 2.55273 MiB
[05/29/2024-02:54:22] [I] Using random values for input speech
[05/29/2024-02:54:22] [I] Input binding for speech with dimensions 1x50x560 is created.
[05/29/2024-02:54:22] [I] Using random values for input speech_lengths
[05/29/2024-02:54:22] [I] Input binding for speech_lengths with dimensions 1 is created.
[05/29/2024-02:54:22] [I] Output binding for encoder_out with dimensions 1x50x512 is created.
[05/29/2024-02:54:22] [I] Output binding for token_num with dimensions 1 is created.
[05/29/2024-02:54:22] [I] Starting inference
[05/29/2024-02:54:25] [I] Warmup completed 63 queries over 200 ms
[05/29/2024-02:54:25] [I] Timing trace has 1007 queries over 3.00925 s
[05/29/2024-02:54:25] [I]
[05/29/2024-02:54:25] [I] === Trace details ===
[05/29/2024-02:54:25] [I] Trace averages of 10 runs:
[05/29/2024-02:54:25] [I] Average on 10 runs - GPU latency: 3.17921 ms - Host latency: 3.20023 ms (enqueue 1.97693 ms)
[05/29/2024-02:54:25] [I] Average on 10 runs - GPU latency: 3.17829 ms - Host latency: 3.19929 ms (enqueue 1.98038 ms)
[05/29/2024-02:54:25] [I] Average on 10 runs - GPU latency: 3.02613 ms - Host latency: 3.04781 ms (enqueue 1.89464 ms)
[05/29/2024-02:54:25] [I] Average on 10 runs - GPU latency: 2.97871 ms - Host latency: 2.99977 ms (enqueue 1.85199 ms)
[05/29/2024-02:54:25] [I] Average on 10 runs - GPU latency: 2.9779 ms - Host latency: 2.99878 ms (enqueue 1.85136 ms)
[05/29/2024-02:54:25] [I] Average on 10 runs - GPU latency: 2.97943 ms - Host latency: 3.0003 ms (enqueue 1.85306 ms)
[05/29/2024-02:54:25] [I] Average on 10 runs - GPU latency: 2.97943 ms - Host latency: 3.00099 ms (enqueue 1.853 ms)
[05/29/2024-02:54:25] [I] Average on 10 runs - GPU latency: 2.98035 ms - Host latency: 3.00204 ms (enqueue 1.81691 ms)
[05/29/2024-02:54:25] [I] Average on 10 runs - GPU latency: 2.97799 ms - Host latency: 2.99872 ms (enqueue 1.86256 ms)
[05/29/2024-02:54:25] [I] Average on 10 runs - GPU latency: 2.9781 ms - Host latency: 2.99976 ms (enqueue 1.85029 ms)
[05/29/2024-02:54:25] [I] Average on 10 runs - GPU latency: 2.9779 ms - Host latency: 2.99912 ms (enqueue 1.84964 ms)
[05/29/2024-02:54:25] [I] Average on 10 runs - GPU latency: 2.97769 ms - Host latency: 2.99877 ms (enqueue 1.84998 ms)
[05/29/2024-02:54:25] [I] Average on 10 runs - GPU latency: 2.97881 ms - Host latency: 2.99927 ms (enqueue 1.85559 ms)
[05/29/2024-02:54:25] [I] Average on 10 runs - GPU latency: 2.97727 ms - Host latency: 2.99816 ms (enqueue 1.84687 ms)
[05/29/2024-02:54:25] [I] Average on 10 runs - GPU latency: 2.97787 ms - Host latency: 2.99885 ms (enqueue 1.85001 ms)
[05/29/2024-02:54:25] [I] Average on 10 runs - GPU latency: 2.9782 ms - Host latency: 2.9993 ms (enqueue 1.85407 ms)
[05/29/2024-02:54:25] [I] Average on 10 runs - GPU latency: 2.9783 ms - Host latency: 2.99943 ms (enqueue 1.85168 ms)
[05/29/2024-02:54:25] [I] Average on 10 runs - GPU latency: 2.9784 ms - Host latency: 2.99906 ms (enqueue 1.85209 ms)
[05/29/2024-02:54:25] [I] Average on 10 runs - GPU latency: 2.97799 ms - Host latency: 2.99916 ms (enqueue 1.84691 ms)
[05/29/2024-02:54:25] [I] Average on 10 runs - GPU latency: 2.97819 ms - Host latency: 2.9991 ms (enqueue 1.84547 ms)
[05/29/2024-02:54:25] [I] Average on 10 runs - GPU latency: 2.978 ms - Host latency: 2.99881 ms (enqueue 1.84897 ms)
[05/29/2024-02:54:25] [I] Average on 10 runs - GPU latency: 2.9778 ms - Host latency: 2.99895 ms (enqueue 1.84828 ms)
[05/29/2024-02:54:25] [I] Average on 10 runs - GPU latency: 2.97902 ms - Host latency: 3.00034 ms (enqueue 1.85102 ms)
[05/29/2024-02:54:25] [I] Average on 10 runs - GPU latency: 2.97871 ms - Host latency: 2.99943 ms (enqueue 1.85173 ms)
[05/29/2024-02:54:25] [I] Average on 10 runs - GPU latency: 2.98024 ms - Host latency: 3.00153 ms (enqueue 1.8485 ms)
[05/29/2024-02:54:25] [I] Average on 10 runs - GPU latency: 2.97799 ms - Host latency: 2.99877 ms (enqueue 1.85088 ms)
[05/29/2024-02:54:25] [I] Average on 10 runs - GPU latency: 2.97778 ms - Host latency: 2.99881 ms (enqueue 1.84941 ms)
[05/29/2024-02:54:25] [I] Average on 10 runs - GPU latency: 2.97819 ms - Host latency: 2.99902 ms (enqueue 1.85134 ms)
[05/29/2024-02:54:25] [I] Average on 10 runs - GPU latency: 2.97964 ms - Host latency: 3.00033 ms (enqueue 1.85685 ms)
[05/29/2024-02:54:25] [I] Average on 10 runs - GPU latency: 2.98015 ms - Host latency: 3.00155 ms (enqueue 1.84985 ms)
[05/29/2024-02:54:25] [I] Average on 10 runs - GPU latency: 2.979 ms - Host latency: 2.9991 ms (enqueue 1.85642 ms)
[05/29/2024-02:54:25] [I] Average on 10 runs - GPU latency: 2.97819 ms - Host latency: 2.999 ms (enqueue 1.85294 ms)
[05/29/2024-02:54:25] [I] Average on 10 runs - GPU latency: 2.97887 ms - Host latency: 2.99963 ms (enqueue 1.84971 ms)
[05/29/2024-02:54:25] [I] Average on 10 runs - GPU latency: 2.97925 ms - Host latency: 3.00031 ms (enqueue 1.8532 ms)
[05/29/2024-02:54:25] [I] Average on 10 runs - GPU latency: 2.97891 ms - Host latency: 2.99999 ms (enqueue 1.84956 ms)
[05/29/2024-02:54:25] [I] Average on 10 runs - GPU latency: 2.98021 ms - Host latency: 3.00105 ms (enqueue 1.85057 ms)
[05/29/2024-02:54:25] [I] Average on 10 runs - GPU latency: 2.97789 ms - Host latency: 2.99888 ms (enqueue 1.85332 ms)
[05/29/2024-02:54:25] [I] Average on 10 runs - GPU latency: 2.978 ms - Host latency: 2.99866 ms (enqueue 1.85165 ms)
[05/29/2024-02:54:25] [I] Average on 10 runs - GPU latency: 2.9781 ms - Host latency: 2.99882 ms (enqueue 1.85085 ms)
[05/29/2024-02:54:25] [I] Average on 10 runs - GPU latency: 2.97942 ms - Host latency: 3.00029 ms (enqueue 1.85186 ms)
[05/29/2024-02:54:25] [I] Average on 10 runs - GPU latency: 2.97972 ms - Host latency: 3.00065 ms (enqueue 1.84847 ms)
[05/29/2024-02:54:25] [I] Average on 10 runs - GPU latency: 2.97913 ms - Host latency: 3.00031 ms (enqueue 1.84991 ms)
[05/29/2024-02:54:25] [I] Average on 10 runs - GPU latency: 2.97972 ms - Host latency: 3.00067 ms (enqueue 1.85428 ms)
[05/29/2024-02:54:25] [I] Average on 10 runs - GPU latency: 2.97882 ms - Host latency: 2.99955 ms (enqueue 1.85068 ms)
[05/29/2024-02:54:25] [I] Average on 10 runs - GPU latency: 2.98015 ms - Host latency: 3.00089 ms (enqueue 1.85258 ms)
[05/29/2024-02:54:25] [I] Average on 10 runs - GPU latency: 2.97761 ms - Host latency: 2.9994 ms (enqueue 1.84553 ms)
[05/29/2024-02:54:25] [I] Average on 10 runs - GPU latency: 2.97819 ms - Host latency: 2.99889 ms (enqueue 1.96732 ms)
[05/29/2024-02:54:25] [I] Average on 10 runs - GPU latency: 2.97782 ms - Host latency: 2.99874 ms (enqueue 1.85408 ms)
[05/29/2024-02:54:25] [I] Average on 10 runs - GPU latency: 2.97935 ms - Host latency: 3.00051 ms (enqueue 1.85073 ms)
[05/29/2024-02:54:25] [I] Average on 10 runs - GPU latency: 2.97874 ms - Host latency: 2.99933 ms (enqueue 1.85159 ms)
[05/29/2024-02:54:25] [I] Average on 10 runs - GPU latency: 2.97925 ms - Host latency: 3.00016 ms (enqueue 1.85082 ms)
[05/29/2024-02:54:25] [I] Average on 10 runs - GPU latency: 2.97798 ms - Host latency: 2.9984 ms (enqueue 1.85258 ms)
[05/29/2024-02:54:25] [I] Average on 10 runs - GPU latency: 2.97748 ms - Host latency: 2.9985 ms (enqueue 1.84805 ms)
[05/29/2024-02:54:25] [I] Average on 10 runs - GPU latency: 2.97852 ms - Host latency: 2.99921 ms (enqueue 1.85267 ms)
[05/29/2024-02:54:25] [I] Average on 10 runs - GPU latency: 2.97954 ms - Host latency: 3.00033 ms (enqueue 1.85421 ms)
[05/29/2024-02:54:25] [I] Average on 10 runs - GPU latency: 2.97981 ms - Host latency: 3.00092 ms (enqueue 1.85297 ms)
[05/29/2024-02:54:25] [I] Average on 10 runs - GPU latency: 2.9781 ms - Host latency: 2.99899 ms (enqueue 1.85111 ms)
[05/29/2024-02:54:25] [I] Average on 10 runs - GPU latency: 2.97925 ms - Host latency: 3.0021 ms (enqueue 1.84869 ms)
[05/29/2024-02:54:25] [I] Average on 10 runs - GPU latency: 2.97861 ms - Host latency: 2.9994 ms (enqueue 1.84918 ms)
[05/29/2024-02:54:25] [I] Average on 10 runs - GPU latency: 2.9778 ms - Host latency: 2.99891 ms (enqueue 1.85002 ms)
[05/29/2024-02:54:25] [I] Average on 10 runs - GPU latency: 2.97782 ms - Host latency: 2.99847 ms (enqueue 1.85129 ms)
[05/29/2024-02:54:25] [I] Average on 10 runs - GPU latency: 2.978 ms - Host latency: 2.99889 ms (enqueue 1.85127 ms)
[05/29/2024-02:54:25] [I] Average on 10 runs - GPU latency: 2.97849 ms - Host latency: 2.99949 ms (enqueue 1.84858 ms)
[05/29/2024-02:54:25] [I] Average on 10 runs - GPU latency: 2.97771 ms - Host latency: 2.99841 ms (enqueue 1.85659 ms)
[05/29/2024-02:54:25] [I] Average on 10 runs - GPU latency: 2.97734 ms - Host latency: 2.99805 ms (enqueue 1.85562 ms)
[05/29/2024-02:54:25] [I] Average on 10 runs - GPU latency: 2.97847 ms - Host latency: 2.99934 ms (enqueue 1.85945 ms)
[05/29/2024-02:54:25] [I] Average on 10 runs - GPU latency: 2.97769 ms - Host latency: 2.99863 ms (enqueue 1.85483 ms)
[05/29/2024-02:54:25] [I] Average on 10 runs - GPU latency: 2.97947 ms - Host latency: 3.00093 ms (enqueue 1.85298 ms)
[05/29/2024-02:54:25] [I] Average on 10 runs - GPU latency: 2.97805 ms - Host latency: 2.9988 ms (enqueue 1.85149 ms)
[05/29/2024-02:54:25] [I] Average on 10 runs - GPU latency: 2.97888 ms - Host latency: 3.0002 ms (enqueue 1.84775 ms)
[05/29/2024-02:54:25] [I] Average on 10 runs - GPU latency: 2.97766 ms - Host latency: 2.99817 ms (enqueue 1.85303 ms)
[05/29/2024-02:54:25] [I] Average on 10 runs - GPU latency: 2.97859 ms - Host latency: 2.99976 ms (enqueue 1.84524 ms)
[05/29/2024-02:54:25] [I] Average on 10 runs - GPU latency: 2.97944 ms - Host latency: 3.00068 ms (enqueue 1.84783 ms)
[05/29/2024-02:54:25] [I] Average on 10 runs - GPU latency: 2.97891 ms - Host latency: 3.00039 ms (enqueue 1.85098 ms)
[05/29/2024-02:54:25] [I] Average on 10 runs - GPU latency: 2.98022 ms - Host latency: 3.00107 ms (enqueue 1.85093 ms)
[05/29/2024-02:54:25] [I] Average on 10 runs - GPU latency: 2.9783 ms - Host latency: 2.999 ms (enqueue 1.85256 ms)
[05/29/2024-02:54:25] [I] Average on 10 runs - GPU latency: 2.97993 ms - Host latency: 3.00205 ms (enqueue 1.84807 ms)
[05/29/2024-02:54:25] [I] Average on 10 runs - GPU latency: 2.97996 ms - Host latency: 3.00103 ms (enqueue 1.85039 ms)
[05/29/2024-02:54:25] [I] Average on 10 runs - GPU latency: 2.97891 ms - Host latency: 2.99976 ms (enqueue 1.85471 ms)
[05/29/2024-02:54:25] [I] Average on 10 runs - GPU latency: 2.97756 ms - Host latency: 2.99866 ms (enqueue 1.85386 ms)
[05/29/2024-02:54:25] [I] Average on 10 runs - GPU latency: 2.97827 ms - Host latency: 2.99956 ms (enqueue 1.84517 ms)
[05/29/2024-02:54:25] [I] Average on 10 runs - GPU latency: 2.97798 ms - Host latency: 2.99861 ms (enqueue 1.8543 ms)
[05/29/2024-02:54:25] [I] Average on 10 runs - GPU latency: 2.97769 ms - Host latency: 2.99868 ms (enqueue 1.84785 ms)
[05/29/2024-02:54:25] [I] Average on 10 runs - GPU latency: 2.9791 ms - Host latency: 3.00046 ms (enqueue 1.84868 ms)
[05/29/2024-02:54:25] [I] Average on 10 runs - GPU latency: 2.97791 ms - Host latency: 2.99829 ms (enqueue 1.85579 ms)
[05/29/2024-02:54:25] [I] Average on 10 runs - GPU latency: 2.97856 ms - Host latency: 2.99985 ms (enqueue 1.85256 ms)
[05/29/2024-02:54:25] [I] Average on 10 runs - GPU latency: 2.97815 ms - Host latency: 2.99934 ms (enqueue 1.84851 ms)
[05/29/2024-02:54:25] [I] Average on 10 runs - GPU latency: 2.97747 ms - Host latency: 2.99839 ms (enqueue 1.84844 ms)
[05/29/2024-02:54:25] [I] Average on 10 runs - GPU latency: 2.97834 ms - Host latency: 2.99915 ms (enqueue 1.8499 ms)
[05/29/2024-02:54:25] [I] Average on 10 runs - GPU latency: 2.97988 ms - Host latency: 3.00085 ms (enqueue 1.8541 ms)
[05/29/2024-02:54:25] [I] Average on 10 runs - GPU latency: 2.97827 ms - Host latency: 2.99966 ms (enqueue 1.84546 ms)
[05/29/2024-02:54:25] [I] Average on 10 runs - GPU latency: 2.97773 ms - Host latency: 2.9988 ms (enqueue 1.85298 ms)
[05/29/2024-02:54:25] [I] Average on 10 runs - GPU latency: 2.97891 ms - Host latency: 3.00015 ms (enqueue 1.85032 ms)
[05/29/2024-02:54:25] [I] Average on 10 runs - GPU latency: 2.97727 ms - Host latency: 2.99861 ms (enqueue 1.84685 ms)
[05/29/2024-02:54:25] [I] Average on 10 runs - GPU latency: 2.97937 ms - Host latency: 3.00022 ms (enqueue 1.85291 ms)
[05/29/2024-02:54:25] [I] Average on 10 runs - GPU latency: 2.97893 ms - Host latency: 2.99958 ms (enqueue 1.85239 ms)
[05/29/2024-02:54:25] [I] Average on 10 runs - GPU latency: 2.97856 ms - Host latency: 2.99954 ms (enqueue 1.84795 ms)
[05/29/2024-02:54:25] [I] Average on 10 runs - GPU latency: 2.97947 ms - Host latency: 3.00059 ms (enqueue 1.85183 ms)
[05/29/2024-02:54:25] [I] Average on 10 runs - GPU latency: 2.97712 ms - Host latency: 2.9978 ms (enqueue 1.84915 ms)
[05/29/2024-02:54:25] [I] Average on 10 runs - GPU latency: 2.97861 ms - Host latency: 2.99944 ms (enqueue 1.92268 ms)
[05/29/2024-02:54:25] [I]
[05/29/2024-02:54:25] [I] === Performance summary ===
[05/29/2024-02:54:25] [I] Throughput: 334.634 qps
[05/29/2024-02:54:25] [I] Latency: min = 2.9931 ms, max = 3.20235 ms, mean = 3.00403 ms, median = 2.99963 ms, percentile(90%) = 3.00293 ms, percentile(95%) = 3.00409 ms, percentile(99%) = 3.20033 ms
[05/29/2024-02:54:25] [I] Enqueue Time: min = 1.79886 ms, max = 3.00439 ms, mean = 1.85568 ms, median = 1.84735 ms, percentile(90%) = 1.86761 ms, percentile(95%) = 1.86987 ms, percentile(99%) = 1.97707 ms
[05/29/2024-02:54:25] [I] H2D Latency: min = 0.00976562 ms, max = 0.0216064 ms, mean = 0.0118168 ms, median = 0.012085 ms, percentile(90%) = 0.0126953 ms, percentile(95%) = 0.0129395 ms, percentile(99%) = 0.0134277 ms
[05/29/2024-02:54:25] [I] GPU Compute Time: min = 2.97266 ms, max = 3.18156 ms, mean = 2.98304 ms, median = 2.97876 ms, percentile(90%) = 2.98181 ms, percentile(95%) = 2.98193 ms, percentile(99%) = 3.17952 ms
[05/29/2024-02:54:25] [I] D2H Latency: min = 0.00805664 ms, max = 0.0115356 ms, mean = 0.00918606 ms, median = 0.0090332 ms, percentile(90%) = 0.0101318 ms, percentile(95%) = 0.0106201 ms, percentile(99%) = 0.0112305 ms
[05/29/2024-02:54:25] [I] Total Host Walltime: 3.00925 s
[05/29/2024-02:54:25] [I] Total GPU Compute Time: 3.00392 s
[05/29/2024-02:54:25] [I] Explanations of the performance metrics are printed in the verbose logs.
[05/29/2024-02:54:25] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v100001] # /mnt/local-storage/zhuangzhong/TensorRT-10.0.1.6/bin/trtexec --onnx=/mnt/local-storage/zhuangzhong/models/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch-split/model.onnx --saveEngine=/mnt/local-storage/zhuangzhong/models/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch-split/model.engine
However, when I inference using python script. Segmentation Fault occurs after self.context.execute_async_v3(stream_handle=self.stream.handle). Python scripts:
import os
import sys
import time
import argparse
import numpy as np
import pycuda.driver as cuda
import pycuda.autoinit
import tensorrt as trt
import torch
from collections import OrderedDict
class TensorRTInference:
def init(self, engine_path):
self.logger = trt.Logger(trt.Logger.INFO)
self.runtime = trt.Runtime(self.logger)
self.engine = self.load_engine(engine_path)
self.context = self.engine.create_execution_context()
# Allocate buffers
self.inputs, self.outputs, self.bindings, self.stream = self.allocate_buffers(
self.engine
)
def load_engine(self, engine_path):
with open(engine_path, "rb") as f:
engine = self.runtime.deserialize_cuda_engine(f.read())
return engine
class HostDeviceMem:
def __init__(self, host_mem, device_mem):
self.host = host_mem
self.device = device_mem
def allocate_buffers(self, engine):
inputs, outputs, bindings = [], [], []
stream = cuda.Stream()
for i in range(engine.num_io_tensors):
tensor_name = engine.get_tensor_name(i)
size = trt.volume(engine.get_tensor_shape(tensor_name))
dtype = trt.nptype(engine.get_tensor_dtype(tensor_name))
# Allocate host and device buffers
host_mem = cuda.pagelocked_empty(size, dtype)
device_mem = cuda.mem_alloc(host_mem.nbytes)
# Append the device buffer address to device bindings
bindings.append(int(device_mem))
# Append to the appropiate input/output list
if engine.get_tensor_mode(tensor_name) == trt.TensorIOMode.INPUT:
inputs.append(self.HostDeviceMem(host_mem, device_mem))
else:
outputs.append(self.HostDeviceMem(host_mem, device_mem))
return inputs, outputs, bindings, stream
def infer(self, input_data):
# Transfer input data to device
# Set tensor address
for i in range(self.engine.num_io_tensors):
self.context.set_tensor_address(
self.engine.get_tensor_name(i), self.bindings[i]
)
for i, inp in enumerate(input_data):
np.copyto(self.inputs[i].host, inp.ravel())
cuda.memcpy_htod_async(
self.inputs[i].device, self.inputs[i].host, self.stream
)
assert self.context.all_binding_shapes_specified
import pdb
pdb.set_trace()
# Run inference
self.context.execute_async_v3(stream_handle=self.stream.handle)
# Transfer predictions back
cuda.memcpy_dtoh_async(
self.outputs[0].host, self.outputs[0].device, self.stream
)
# Synchronize the stream
self.stream.synchronize()
return self.outputs[0].host
if name == “main”:
# Load model
engine_path = “/mnt/local-storage/zhuangzhong/models/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch-split/model.engine”
trt_engine = TensorRTInference(engine_path)
feats = np.load(“/mnt/local-storage/zhuangzhong/onnx_workspace/feats.npy”)
feats_len = np.load(“/mnt/local-storage/zhuangzhong/onnx_workspace/feats_len.npy”)
# feats = torch.from_numpy(feats).cuda()
# feats_len = torch.from_numpy(feats_len).cuda()
result = trt_engine.infer((feats, feats_len))
The onnx model can inference correctly, The onnx model is too big to upload. If you need it, just tell me.
Thank you.
Environment
TensorRT Version: 10.0.1.6
GPU Type: 4090
Nvidia Driver Version: 535.154.05
CUDA Version: 12.2
CUDNN Version: 8902
Operating System + Version: Ubuntu 22.04.3 LTS
Python Version (if applicable): 3.8
TensorFlow Version (if applicable):
PyTorch Version (if applicable): 2.3.0
Baremetal or Container (if container which image + tag): Baremetal