Description
I converted wav2vec2 to ONNX and now I want convert to tensorrt (TRT) using trtexec
command. First I tried Nvidia TensorRT container (nvcr.io/nvidia/tensorrt:21.11-py3) which works correctly and convert model successfully.
Then I tried to convert ONNX model to trt on my local machine. I installed CUDA, CUDNN and TensorRT packages using .deb
local repos with same version in tensorrt:21.11-py3
container like this :
Environment
TensorRT Version: 8.0.3-1+cuda11.3
GPU Type: NVIDIA GeForce GTX 1650 Ti
Nvidia Driver Version: 495.29.05
CUDA Version: 11.5 [11.3 and 11.4 installed also]
CUDNN Version: 8.3.1.22-1+cuda11.5
Operating System + Version: Ubuntu 18.04
Python Version (if applicable): 3.8.12
TensorFlow Version (if applicable): 2.7.0
PyTorch Version (if applicable): 1.10.0
Baremetal or Container (if container which image + tag): -
Relevant Files
Steps To Reproduce
1. Convert Pytorch to ONNX
First convert wav2vec2 (Pytorch) to ONNX using these lines:
import torch
import os
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
device = torch.device('cuda')
model_path = "facebook/wav2vec2-large-960h-lv60-self"
processor = Wav2Vec2Processor.from_pretrained(model_path)
model = Wav2Vec2ForCTC.from_pretrained(model_path).to(device)
dummy_input = torch.rand([1,3600]).to(device)
input_names = ["input"]
output_names = ["output"]
torch.onnx.export(model, dummy_input, "wav2vec2.onnx", verbose=True, input_names=input_names, output_names=output_names)
2. Use trtexec to Convert ONNX to TensorRT
Now use trtexec
command to convert ONNX model to TensorRT :
$ trtexec --onnx=wav2vec2.onnx --saveEngine=test.trt
3. Traceback
Traceback which is shown to me is this :
$ trtexec --onnx=wav2vec2.onnx --saveEngine=test.trt
&&&& RUNNING TensorRT.trtexec [TensorRT v8003] # trtexec --onnx=wav2vec2.onnx --saveEngine=test.trt
[12/12/2021-12:56:40] [I] === Model Options ===
[12/12/2021-12:56:40] [I] Format: ONNX
[12/12/2021-12:56:40] [I] Model: wav2vec2.onnx
[12/12/2021-12:56:40] [I] Output:
[12/12/2021-12:56:40] [I] === Build Options ===
[12/12/2021-12:56:40] [I] Max batch: explicit
[12/12/2021-12:56:40] [I] Workspace: 16 MiB
[12/12/2021-12:56:40] [I] minTiming: 1
[12/12/2021-12:56:40] [I] avgTiming: 8
[12/12/2021-12:56:40] [I] Precision: FP32
[12/12/2021-12:56:40] [I] Calibration:
[12/12/2021-12:56:40] [I] Refit: Disabled
[12/12/2021-12:56:40] [I] Sparsity: Disabled
[12/12/2021-12:56:40] [I] Safe mode: Disabled
[12/12/2021-12:56:40] [I] Restricted mode: Disabled
[12/12/2021-12:56:40] [I] Save engine: test.trt
[12/12/2021-12:56:40] [I] Load engine:
[12/12/2021-12:56:40] [I] NVTX verbosity: 0
[12/12/2021-12:56:40] [I] Tactic sources: Using default tactic sources
[12/12/2021-12:56:40] [I] timingCacheMode: local
[12/12/2021-12:56:40] [I] timingCacheFile:
[12/12/2021-12:56:40] [I] Input(s)s format: fp32:CHW
[12/12/2021-12:56:40] [I] Output(s)s format: fp32:CHW
[12/12/2021-12:56:40] [I] Input build shapes: model
[12/12/2021-12:56:40] [I] Input calibration shapes: model
[12/12/2021-12:56:40] [I] === System Options ===
[12/12/2021-12:56:40] [I] Device: 0
[12/12/2021-12:56:40] [I] DLACore:
[12/12/2021-12:56:40] [I] Plugins:
[12/12/2021-12:56:40] [I] === Inference Options ===
[12/12/2021-12:56:40] [I] Batch: Explicit
[12/12/2021-12:56:40] [I] Input inference shapes: model
[12/12/2021-12:56:40] [I] Iterations: 10
[12/12/2021-12:56:40] [I] Duration: 3s (+ 200ms warm up)
[12/12/2021-12:56:40] [I] Sleep time: 0ms
[12/12/2021-12:56:40] [I] Streams: 1
[12/12/2021-12:56:40] [I] ExposeDMA: Disabled
[12/12/2021-12:56:40] [I] Data transfers: Enabled
[12/12/2021-12:56:40] [I] Spin-wait: Disabled
[12/12/2021-12:56:40] [I] Multithreading: Disabled
[12/12/2021-12:56:40] [I] CUDA Graph: Disabled
[12/12/2021-12:56:40] [I] Separate profiling: Disabled
[12/12/2021-12:56:40] [I] Time Deserialize: Disabled
[12/12/2021-12:56:40] [I] Time Refit: Disabled
[12/12/2021-12:56:40] [I] Skip inference: Disabled
[12/12/2021-12:56:40] [I] Inputs:
[12/12/2021-12:56:40] [I] === Reporting Options ===
[12/12/2021-12:56:40] [I] Verbose: Disabled
[12/12/2021-12:56:40] [I] Averages: 10 inferences
[12/12/2021-12:56:40] [I] Percentile: 99
[12/12/2021-12:56:40] [I] Dump refittable layers:Disabled
[12/12/2021-12:56:40] [I] Dump output: Disabled
[12/12/2021-12:56:40] [I] Profile: Disabled
[12/12/2021-12:56:40] [I] Export timing to JSON file:
[12/12/2021-12:56:40] [I] Export output to JSON file:
[12/12/2021-12:56:40] [I] Export profile to JSON file:
[12/12/2021-12:56:40] [I]
[12/12/2021-12:56:40] [I] === Device Information ===
[12/12/2021-12:56:40] [I] Selected Device: NVIDIA GeForce GTX 1650 Ti
[12/12/2021-12:56:40] [I] Compute Capability: 7.5
[12/12/2021-12:56:40] [I] SMs: 16
[12/12/2021-12:56:40] [I] Compute Clock Rate: 1.485 GHz
[12/12/2021-12:56:40] [I] Device Global Memory: 3903 MiB
[12/12/2021-12:56:40] [I] Shared Memory per SM: 64 KiB
[12/12/2021-12:56:40] [I] Memory Bus Width: 128 bits (ECC disabled)
[12/12/2021-12:56:40] [I] Memory Clock Rate: 6.001 GHz
[12/12/2021-12:56:40] [I]
[12/12/2021-12:56:40] [I] TensorRT version: 8003
[12/12/2021-12:56:41] [I] [TRT] [MemUsageChange] Init CUDA: CPU +330, GPU +0, now: CPU 338, GPU 623 (MiB)
[12/12/2021-12:56:41] [I] Start parsing network model
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:604] Reading dangerously large protocol message. If the message turns out to be larger than 2147483647 bytes, parsing will be halted for security reasons. To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:81] The total number of bytes read was 1264813709
[12/12/2021-12:56:41] [I] [TRT] ----------------------------------------------------------------
[12/12/2021-12:56:41] [I] [TRT] Input filename: wav2vec2.onnx
[12/12/2021-12:56:41] [I] [TRT] ONNX IR version: 0.0.7
[12/12/2021-12:56:41] [I] [TRT] Opset version: 9
[12/12/2021-12:56:41] [I] [TRT] Producer name: pytorch
[12/12/2021-12:56:41] [I] [TRT] Producer version: 1.10
[12/12/2021-12:56:41] [I] [TRT] Domain:
[12/12/2021-12:56:41] [I] [TRT] Model version: 0
[12/12/2021-12:56:41] [I] [TRT] Doc string:
[12/12/2021-12:56:41] [I] [TRT] ----------------------------------------------------------------
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:604] Reading dangerously large protocol message. If the message turns out to be larger than 2147483647 bytes, parsing will be halted for security reasons. To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:81] The total number of bytes read was 1264813709
[12/12/2021-12:56:42] [W] [TRT] onnx2trt_utils.cpp:364: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[12/12/2021-12:56:44] [I] Finish parsing network model
[12/12/2021-12:56:44] [I] [TRT] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 1551, GPU 632 (MiB)
[12/12/2021-12:56:44] [I] [TRT] [MemUsageSnapshot] Builder begin: CPU 1551 MiB, GPU 632 MiB
[12/12/2021-12:56:48] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +517, GPU +224, now: CPU 2071, GPU 857 (MiB)
[12/12/2021-12:56:48] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +115, GPU +52, now: CPU 2186, GPU 909 (MiB)
[12/12/2021-12:56:48] [W] [TRT] Detected invalid timing cache, setup a local cache instead
[12/12/2021-12:57:19] [I] [TRT] Some tactics do not have sufficient workspace memory to run. Increasing workspace size may increase performance, please check verbose output.
free(): double free detected in tcache 2
Aborted (core dumped)
But in tensorrt:21.11-py3
container everything done correctly. Is there any suggestion to address this issues?