Description
hi,guys,i am having some problem when i use TensorRT to optimize yolact++,you know,TensorRT does not support DCNv2,so i find a DCNv2 TensorRT Plugin in github and i transform my yolact++ to trt successfully,but when i run trt model to infer,some error occured.i don’t know what to do.Maybe some wrong in .cu?i am a new player in CUDA program.Can you give me a hand?
this plugin i add noexcept to suit TensorRT8.0.others no changed.
DCNv2 Plugin
error information
[TensorRT] VERBOSE: Registered plugin creator - ::BatchTilePlugin_TRT version 1
[TensorRT] VERBOSE: Registered plugin creator - ::BatchedNMS_TRT version 1
[TensorRT] VERBOSE: Registered plugin creator - ::BatchedNMSDynamic_TRT version 1
[TensorRT] VERBOSE: Registered plugin creator - ::CoordConvAC version 1
[TensorRT] VERBOSE: Registered plugin creator - ::CropAndResize version 1
[TensorRT] VERBOSE: Registered plugin creator - ::CropAndResizeDynamic version 1
[TensorRT] VERBOSE: Registered plugin creator - ::DetectionLayer_TRT version 1
[TensorRT] VERBOSE: Registered plugin creator - ::EfficientNMS_ONNX_TRT version 1
[TensorRT] VERBOSE: Registered plugin creator - ::EfficientNMS_TRT version 1
[TensorRT] VERBOSE: Registered plugin creator - ::FlattenConcat_TRT version 1
[TensorRT] VERBOSE: Registered plugin creator - ::GenerateDetection_TRT version 1
[TensorRT] VERBOSE: Registered plugin creator - ::GridAnchor_TRT version 1
[TensorRT] VERBOSE: Registered plugin creator - ::GridAnchorRect_TRT version 1
[TensorRT] VERBOSE: Registered plugin creator - ::InstanceNormalization_TRT version 1
[TensorRT] VERBOSE: Registered plugin creator - ::LReLU_TRT version 1
[TensorRT] VERBOSE: Registered plugin creator - ::MultilevelCropAndResize_TRT version 1
[TensorRT] VERBOSE: Registered plugin creator - ::MultilevelProposeROI_TRT version 1
[TensorRT] VERBOSE: Registered plugin creator - ::NMS_TRT version 1
[TensorRT] VERBOSE: Registered plugin creator - ::NMSDynamic_TRT version 1
[TensorRT] VERBOSE: Registered plugin creator - ::Normalize_TRT version 1
[TensorRT] VERBOSE: Registered plugin creator - ::PriorBox_TRT version 1
[TensorRT] VERBOSE: Registered plugin creator - ::ProposalLayer_TRT version 1
[TensorRT] VERBOSE: Registered plugin creator - ::Proposal version 1
[TensorRT] VERBOSE: Registered plugin creator - ::ProposalDynamic version 1
[TensorRT] VERBOSE: Registered plugin creator - ::PyramidROIAlign_TRT version 1
[TensorRT] VERBOSE: Registered plugin creator - ::Region_TRT version 1
[TensorRT] VERBOSE: Registered plugin creator - ::Reorg_TRT version 1
[TensorRT] VERBOSE: Registered plugin creator - ::ResizeNearest_TRT version 1
[TensorRT] VERBOSE: Registered plugin creator - ::RPROI_TRT version 1
[TensorRT] VERBOSE: Registered plugin creator - ::ScatterND version 1
[TensorRT] VERBOSE: Registered plugin creator - ::SpecialSlice_TRT version 1
[TensorRT] VERBOSE: Registered plugin creator - ::Split version 1
[TensorRT] VERBOSE: Registered plugin creator - ::DCNv2 version 1
Reading engine from file yolact.trt
[TensorRT] INFO: [MemUsageChange] Init CUDA: CPU +300, GPU +0, now: CPU 361, GPU 5136 (MiB)
[TensorRT] INFO: Loaded engine size: 408 MB
[TensorRT] INFO: [MemUsageSnapshot] deserializeCudaEngine begin: CPU 361 MiB, GPU 5136 MiB
[TensorRT] VERBOSE: Using cublasLt a tactic source
[TensorRT] INFO: [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 791, GPU 5592 (MiB)
[TensorRT] VERBOSE: Using cuDNN as a tactic source
[TensorRT] INFO: [MemUsageChange] Init cuDNN: CPU +439, GPU +172, now: CPU 1230, GPU 5764 (MiB)
[TensorRT] INFO: [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 3110, GPU 7010 (MiB)
[TensorRT] VERBOSE: Deserialization required 16166066 microseconds.
[TensorRT] INFO: [MemUsageSnapshot] deserializeCudaEngine end: CPU 3110 MiB, GPU 7010 MiB
[TensorRT] VERBOSE: Using cublasLt a tactic source
[TensorRT] INFO: [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 3110, GPU 7058 (MiB)
[TensorRT] VERBOSE: Using cuDNN as a tactic source
[TensorRT] INFO: [MemUsageChange] Init cuDNN: CPU +1, GPU +8, now: CPU 3111, GPU 7066 (MiB)
[TensorRT] VERBOSE: Total per-runner device memory is 187544576
[TensorRT] VERBOSE: Total per-runner host memory is 167648
[TensorRT] VERBOSE: Allocated activation device memory of size 93570048
[TensorRT] ERROR: 1: [scaleRunner.cpp::execute::139] Error Code 1: Cuda Runtime (an illegal memory access was encountered)
Traceback (most recent call last):
File "/home/ubuntu/.pycharm_helpers/pydev/pydevd.py", line 1668, in <module>
main()
File "/home/ubuntu/.pycharm_helpers/pydev/pydevd.py", line 1662, in main
globals = debugger.run(setup['file'], None, None, is_module)
File "/home/ubuntu/.pycharm_helpers/pydev/pydevd.py", line 1072, in run
pydev_imports.execfile(file, globals, locals) # execute the script
File "/home/ubuntu/.pycharm_helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "/data1/xuduo/optimize/yolact_dir_0804/inference.py", line 128, in <module>
trt_outputs = common.do_inference_v2(context, bindings=bindings, inputs=inputs, outputs=outputs, stream=stream)
File "/data1/xuduo/optimize/yolact_dir_0804/common.py", line 161, in do_inference_v2
[cuda.memcpy_dtoh_async(out.host, out.device, stream) for out in outputs]
File "/data1/xuduo/optimize/yolact_dir_0804/common.py", line 161, in <listcomp>
[cuda.memcpy_dtoh_async(out.host, out.device, stream) for out in outputs]
pycuda._driver.LogicError: cuMemcpyDtoHAsync failed: an illegal memory access was encountered
Environment
- Ubuntu18.04
- GeForce RTX 2080TI
- Driver Version 450.51.06
- NVIDIA-SMI 450.51.06
- CUDA Version: 11.0
- python3.6.8
- Cmake3.13.0
- CUDA toolkit 11.0.221
- CUDNN8.05
- TensorRT8.0-EA(Early Access)
- onnx1.6
- onnx-tensorrt8.0-EA
Relevant Files
inference.py
from PIL import Image
import numpy as np
import time
import cv2
import glob
import config as cfg
import torch.nn.functional as F
import sys
sys.path.insert(1, os.path.join(sys.path[0], ".."))
import common
print("sys.path[0]", sys.path[0])
TRT_LOGGER = trt.Logger(trt.Logger.VERBOSE)
trt.init_libnvinfer_plugins(TRT_LOGGER, '')
EXPLICIT_BATCH = 1 << (int)(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
def GiB(val):
return val * 1 << 30
def preprocess_image(path):
# img [h,w,c]
image = cv2.imread(image_name)
img_raw_data = cv2.imencode('.jpg', image)[1].tobytes()
img_data = cv2.imdecode(np.asarray(bytearray(img_raw_data), dtype=np.uint8),
cv2.IMREAD_COLOR)
frame = torch.from_numpy(img_data).cuda().float()
# print(frame.size)
batch = FastBaseTransform()(frame.unsqueeze(0))
return batch
class FastBaseTransform(torch.nn.Module):
"""
Transform that does all operations on the GPU for super speed.
This doesn't suppport a lot of config settings and should only be used for production.
Maintain this as necessary.
"""
def __init__(self):
super().__init__()
self.mean = torch.Tensor(cfg.MEANS).float().cuda()[None, :, None, None]
self.std = torch.Tensor(cfg.STD).float().cuda()[None, :, None, None]
self.transform = cfg.resnet_transform
def forward(self, img):
self.mean = self.mean.to(img.device)
self.std = self.std.to(img.device)
# img assumed to be a pytorch BGR image with channel order [n, h, w, c]
img_size = (cfg.max_size, cfg.max_size)
# 图片转为[n,c,h,w]格式
img = img.permute(0, 3, 1, 2).contiguous()
img = F.interpolate(img, img_size, mode='bilinear', align_corners=False)
if self.transform.normalize:
img = (img - self.mean) / self.std
elif self.transform.subtract_means:
img = (img - self.mean)
elif self.transform.to_float:
img = img / 255
if self.transform.channel_order != 'RGB':
raise NotImplementedError
img = img[:, (2, 1, 0), :, :].contiguous()
# Return value is in channel order [n, c, h, w] and RGB
return img
# TRT_LOGGER = trt.Logger(trt.Logger.ERROR)
if __name__ == "__main__":
onnx_file_path = 'yolact.onnx'
engine_file_path = "yolact.trt"
# threshold = 0.5
image_name = "/data1/xuduo/optimize/yolact_dir_0804/img/material_WholeBookQuestionData_7H1110B44850N_QuestionBookImage20210713083208_164_586_7H1110B44850N.jpg"
if not os.path.exists(engine_file_path):
print("no engine file")
# conver_engine(onnx_file_path, engine_file_path)
print(f"Reading engine from file {engine_file_path}")
# preprocess_time = 0
# process_time = 0
f = open(engine_file_path, "rb")
runtime = trt.Runtime(TRT_LOGGER)
engine = runtime.deserialize_cuda_engine(f.read())
# Allocate buffers and create a CUDA stream.
inputs, outputs, bindings, stream = common.allocate_buffers(engine)
# Contexts are used to perform inference.
context = engine.create_execution_context()
batch = preprocess_image(image_name)
np.copyto(inputs[0].host, batch.cpu().numpy().ravel())
trt_outputs = common.do_inference_v2(context, bindings=bindings, inputs=inputs, outputs=outputs, stream=stream)
print("预测结果:", trt_outputs)