Description
When the trt-model was inference in C++,I got the error like this:
…/rtSafe/cuda/cudaConvolutionRunner.cpp (483) - Cudnn Error in executeConv: 3 (CUDNN_STATUS_BAD_PARAM)
FAILED_EXECUTION: std::exception
Moreover, the error only happens when I use fp32 mode, it’s ok for int8 mode.
And I am sure the input is fp32, data type is right.
I convert the model from pb to onnx and then convert the onnx to trt model.
Can you give any solutions? Thanks.
Environment
TensorRT Version : TensorRT-7.2.2.3
GPU Type : 1050 Ti
Nvidia Driver Version : 440.82
CUDA Version : 10.2
CUDNN Version : 8
Operating System + Version : ubuntu 18.04
Relevant Files
Please attach or include links to any models, data, files, or scripts necessary to reproduce your issue. (Github repo, Google Drive, Dropbox, etc.)
Steps To Reproduce
Please include:
Exact steps/commands to build your repro
Exact steps/commands to run your repro
Full traceback of errors encountered
what’s more, the version of TensorRT can’t changed because of the demand.And model can’t be offered on account of secret
detailed_logs.zip (104.4 KB)
There is the detailed logs.
Hi,
Based on the logs looks like you could successfully generate the TRT engine, please make sure, you’re handling the data type and following in your inference script correctly.
opened 02:00PM - 31 May 21 UTC
closed 01:52AM - 25 Jan 22 UTC
Release: 7.x
triaged
Runtime: Error
## Description
I'm using dynamic shaped trt engine generated from trtexec to … do serving but I met this error:
```
[TensorRT] ERROR: ../rtSafe/cuda/cudaConvolutionRunner.cpp (483) - Cudnn Error in executeConv: 3 (CUDNN_STATUS_BAD_PARAM)
[TensorRT] ERROR: FAILED_EXECUTION: std::exception
```
## Environment
**TensorRT Version**: 7.2.3.4
**NVIDIA GPU**: 2080 ti
**NVIDIA Driver Version**: 460.73.01
**CUDA Version**: 11.2
**CUDNN Version**: 8
**Operating System**: Ubuntu 18.04
**Python Version (if applicable)**: 3.6.9
**Tensorflow Version (if applicable)**: 2.4.0
**PyTorch Version (if applicable)**:
**Baremetal or Container (if so, version)**:
## Relevant Files
```
import tensorrt as trt
import pycuda.driver as cuda
import threading, numpy as np
import functools, operator
INITED = False
LOCK = threading.Lock()
def init():
global TRT_LOGGER, runtime, cfx, INITED
LOCK.acquire()
if not INITED:
TRT_LOGGER = trt.Logger(trt.Logger.INFO)
trt.init_libnvinfer_plugins(TRT_LOGGER, '')
runtime = trt.Runtime(TRT_LOGGER)
cfx = cuda.Device(0).make_context()
INITED = True
LOCK.release()
class TRTInference:
def __init__(self, trt_engine_path, max_batch_size=1):
init()
stream = cuda.Stream()
# deserialize engine
with open(trt_engine_path, 'rb') as f:
buf = f.read()
engine = runtime.deserialize_cuda_engine(buf)
context = engine.create_execution_context()
# prepare buffer
host_inputs = []
cuda_inputs = []
host_outputs = []
cuda_outputs = []
bindings = []
self.max_batch_size = max_batch_size
for binding in engine:
dim = engine.get_binding_shape(binding)[1:]
size = functools.reduce(operator.mul, dim) * self.max_batch_size
host_mem = cuda.pagelocked_empty(size, np.float32)
cuda_mem = cuda.mem_alloc(host_mem.nbytes)
bindings.append(int(cuda_mem))
if engine.binding_is_input(binding):
self.input_dim = dim
self.input_unit_size = functools.reduce(operator.mul, dim)
host_inputs.append(host_mem)
cuda_inputs.append(cuda_mem)
else:
self.output_dim = dim
self.output_unit_size = functools.reduce(operator.mul, dim)
host_outputs.append(host_mem)
cuda_outputs.append(cuda_mem)
# store
self.stream = stream
self.context = context
self.engine = engine
self.host_inputs = host_inputs
self.cuda_inputs = cuda_inputs
self.host_outputs = host_outputs
self.cuda_outputs = cuda_outputs
self.bindings = bindings
def infer(self, data):
threading.Thread.__init__(self)
batch_size = data.shape[0]
data = np.ravel(data)
cfx.push()
# restore
stream = self.stream
context = self.context
context.set_binding_shape(0, (batch_size, *self.input_dim))
host_inputs = self.host_inputs
cuda_inputs = self.cuda_inputs
host_outputs = self.host_outputs
cuda_outputs = self.cuda_outputs
bindings = self.bindings
# read image
np.copyto(host_inputs[0][:batch_size * self.input_unit_size], data)
# inference
cuda.memcpy_htod_async(cuda_inputs[0], host_inputs[0], stream)
context.execute_async(bindings=bindings, stream_handle=stream.handle)
cuda.memcpy_dtoh_async(host_outputs[0], cuda_outputs[0], stream)
stream.synchronize()
cfx.pop()
return np.reshape(host_outputs[0][:batch_size * self.output_unit_size], (batch_size, *self.output_dim))
def destory(self):
cfx.pop()
```
Engine model: https://drive.google.com/file/d/1es2HktMGl4_murFvmRjE6OG1ki_Tm_dl/view?usp=sharing
## Steps To Reproduce
1. Use the model I provided
2. Use the `TRTInference` class above
3. do the below:
```
for i in range(1, 65):
model.infer(np.ones((i, 260, 260, 15), dtype=np.float32))
```
You will see there are some batch sizes working well but some are not. Very weird. This model i generated, the min, opt, and max batch size are 1, 4, 64 respectively. I also tried using different combination of batch sizes but the behavior seems random. there will always be some random batch size not working. Very strange!
I also tried some other model structures, but same results.
However, if you use `trtexec --loadEngine` and set the shape to a fixed one, all batch sizes work.
One side note: this model is from tensorflow -> onnx -> trt
For your reference, please refer to the following samples.
Thank you.