Description
I used the two patchs provide by NVIDIA official website for cuda 10.2, but it only works for model converting from onnx to trt, and this issue is still occured when evaluting with TensorRT. Even I try to execute its own python program(network_api_pytorch_mnist), and this issue will be occured after two epochs. It works with C++ program but does not with python.
Environment
TensorRT Version: 8.4.1.5
GPU Type: Tesla V100
Nvidia Driver Version: 440.33.01
CUDA Version: 10.2
CUDNN Version: 8.4.1
Operating System + Version: 18.04
Python Version (if applicable): 3.7.10
PyTorch Version (if applicable): 1.8.1
Baremetal or Container (if container which image + tag): Docker 20.10.7
Relevant Files
Program here:
import torch
import torchvision.models as models
import os
import numpy as np
import tensorrt as trt
import pycuda.driver as cuda
import pycuda.autoinit
import time
BATCH_SIZE = 32
USE_FP16 = True
resnext50 = models.resnext50_32x4d(num_classes=10)
dummy_input = torch.randn([BATCH_SIZE, 3, 224, 224], dtype=torch.float16)
resnext50.half()
resnext50, dummy_input = resnext50.cuda(), dummy_input.cuda()
torch.onnx.export(resnext50, dummy_input, ‘resnext50.onnx’, verbose=False)
os.system(r’./trtexec --onnx=resnext50.onnx --saveEngine=resnext50.trt --explicitBatch=32 --inputIOFormats=fp16:chw --outputIOFormats=fp16:chw --fp16’)
target_dtype = np.float16 if USE_FP16 else np.float32
f = open(“resnext50.trt”, “rb”)
runtime = trt.Runtime(trt.Logger(trt.Logger.WARNING))
engine = runtime.deserialize_cuda_engine(f.read())
context = engine.create_execution_context()
input_batch = np.random.randn(BATCH_SIZE, 224, 224, 3).astype(target_dtype)
output = np.empty([BATCH_SIZE, 10], dtype = target_dtype)
d_input = cuda.mem_alloc(1 * input_batch.nbytes)
d_output = cuda.mem_alloc(1 * output.nbytes)
bindings = [int(d_input), int(d_output)]
stream = cuda.Stream()
preprocessed_inputs = np.array([input.transpose([2, 0, 1]) for input in input_batch]) # (BATCH_SIZE,224,224,3)——>(BATCH_SIZE,3,224,224)
for i in range(1000):
t0 = time.time()
cuda.memcpy_htod_async(d_input, preprocessed_inputs, stream)
# context.execute_async_v2(bindings, stream.handle, None)
# context.execute_async(BATCH_SIZE, bindings, stream.handle)
context.execute_v2(bindings)
cuda.memcpy_dtoh_async(output, d_output, stream)
stream.synchronize()
t = time.time() - t0
print(“\rPrediction cost {:.4f}s”.format(t), end=‘’)
print(output[0])