[E] 2: [ltWrapper.cpp::setupHeuristic::349] Error Code 2: Internal Error (Assertion cublasStatus == CUBLAS_STATUS_SUCCESS failed. )


I used the two patchs provide by NVIDIA official website for cuda 10.2, but it only works for model converting from onnx to trt, and this issue is still occured when evaluting with TensorRT. Even I try to execute its own python program(network_api_pytorch_mnist), and this issue will be occured after two epochs. It works with C++ program but does not with python.


TensorRT Version:
GPU Type: Tesla V100
Nvidia Driver Version: 440.33.01
CUDA Version: 10.2
CUDNN Version: 8.4.1
Operating System + Version: 18.04
Python Version (if applicable): 3.7.10
PyTorch Version (if applicable): 1.8.1
Baremetal or Container (if container which image + tag): Docker 20.10.7

Relevant Files

Program here:
import torch
import torchvision.models as models
import os
import numpy as np
import tensorrt as trt
import pycuda.driver as cuda
import pycuda.autoinit
import time

USE_FP16 = True
resnext50 = models.resnext50_32x4d(num_classes=10)
dummy_input = torch.randn([BATCH_SIZE, 3, 224, 224], dtype=torch.float16)
resnext50, dummy_input = resnext50.cuda(), dummy_input.cuda()
torch.onnx.export(resnext50, dummy_input, ‘resnext50.onnx’, verbose=False)
os.system(r’./trtexec --onnx=resnext50.onnx --saveEngine=resnext50.trt --explicitBatch=32 --inputIOFormats=fp16:chw --outputIOFormats=fp16:chw --fp16’)

target_dtype = np.float16 if USE_FP16 else np.float32
f = open(“resnext50.trt”, “rb”)
runtime = trt.Runtime(trt.Logger(trt.Logger.WARNING))
engine = runtime.deserialize_cuda_engine(f.read())
context = engine.create_execution_context()

input_batch = np.random.randn(BATCH_SIZE, 224, 224, 3).astype(target_dtype)
output = np.empty([BATCH_SIZE, 10], dtype = target_dtype)
d_input = cuda.mem_alloc(1 * input_batch.nbytes)
d_output = cuda.mem_alloc(1 * output.nbytes)
bindings = [int(d_input), int(d_output)]

stream = cuda.Stream()
preprocessed_inputs = np.array([input.transpose([2, 0, 1]) for input in input_batch]) # (BATCH_SIZE,224,224,3)——>(BATCH_SIZE,3,224,224)

for i in range(1000):
t0 = time.time()
cuda.memcpy_htod_async(d_input, preprocessed_inputs, stream)
# context.execute_async_v2(bindings, stream.handle, None)
# context.execute_async(BATCH_SIZE, bindings, stream.handle)
cuda.memcpy_dtoh_async(output, d_output, stream)
t = time.time() - t0
print(“\rPrediction cost {:.4f}s”.format(t), end=‘’)


Enabling cuBLAS tactic may help you.
Please refer,



Also, it’s better to avoid using PyTorch-GPU and PyCUDA together. Instead of making allocations with PyCUDA, we can use torch tensors directly with TRT (specifically, we can use the data_ptr() method to get the device memory address: torch.Tensor.data_ptr — PyTorch 1.12 documentation)

Thank you.