pycuda._driver.LogicError: cuStreamSynchronize failed: an illegal memory access was encountered

Description

hi,guys,i am having some problem when i use TensorRT to optimize yolact++,you know,TensorRT does not support DCNv2,so i find a DCNv2 TensorRT Plugin in github and i transform my yolact++ to trt successfully,but when i run trt model to infer,some error occured.i don’t know what to do.Maybe some wrong in .cu?i am a new player in CUDA program.Can you give me a hand?

this plugin i add noexcept to suit TensorRT8.0.others no changed.
DCNv2 Plugin

error information

[TensorRT] VERBOSE: Registered plugin creator - ::BatchTilePlugin_TRT version 1
[TensorRT] VERBOSE: Registered plugin creator - ::BatchedNMS_TRT version 1
[TensorRT] VERBOSE: Registered plugin creator - ::BatchedNMSDynamic_TRT version 1
[TensorRT] VERBOSE: Registered plugin creator - ::CoordConvAC version 1
[TensorRT] VERBOSE: Registered plugin creator - ::CropAndResize version 1
[TensorRT] VERBOSE: Registered plugin creator - ::CropAndResizeDynamic version 1
[TensorRT] VERBOSE: Registered plugin creator - ::DetectionLayer_TRT version 1
[TensorRT] VERBOSE: Registered plugin creator - ::EfficientNMS_ONNX_TRT version 1
[TensorRT] VERBOSE: Registered plugin creator - ::EfficientNMS_TRT version 1
[TensorRT] VERBOSE: Registered plugin creator - ::FlattenConcat_TRT version 1
[TensorRT] VERBOSE: Registered plugin creator - ::GenerateDetection_TRT version 1
[TensorRT] VERBOSE: Registered plugin creator - ::GridAnchor_TRT version 1
[TensorRT] VERBOSE: Registered plugin creator - ::GridAnchorRect_TRT version 1
[TensorRT] VERBOSE: Registered plugin creator - ::InstanceNormalization_TRT version 1
[TensorRT] VERBOSE: Registered plugin creator - ::LReLU_TRT version 1
[TensorRT] VERBOSE: Registered plugin creator - ::MultilevelCropAndResize_TRT version 1
[TensorRT] VERBOSE: Registered plugin creator - ::MultilevelProposeROI_TRT version 1
[TensorRT] VERBOSE: Registered plugin creator - ::NMS_TRT version 1
[TensorRT] VERBOSE: Registered plugin creator - ::NMSDynamic_TRT version 1
[TensorRT] VERBOSE: Registered plugin creator - ::Normalize_TRT version 1
[TensorRT] VERBOSE: Registered plugin creator - ::PriorBox_TRT version 1
[TensorRT] VERBOSE: Registered plugin creator - ::ProposalLayer_TRT version 1
[TensorRT] VERBOSE: Registered plugin creator - ::Proposal version 1
[TensorRT] VERBOSE: Registered plugin creator - ::ProposalDynamic version 1
[TensorRT] VERBOSE: Registered plugin creator - ::PyramidROIAlign_TRT version 1
[TensorRT] VERBOSE: Registered plugin creator - ::Region_TRT version 1
[TensorRT] VERBOSE: Registered plugin creator - ::Reorg_TRT version 1
[TensorRT] VERBOSE: Registered plugin creator - ::ResizeNearest_TRT version 1
[TensorRT] VERBOSE: Registered plugin creator - ::RPROI_TRT version 1
[TensorRT] VERBOSE: Registered plugin creator - ::ScatterND version 1
[TensorRT] VERBOSE: Registered plugin creator - ::SpecialSlice_TRT version 1
[TensorRT] VERBOSE: Registered plugin creator - ::Split version 1
[TensorRT] VERBOSE: Registered plugin creator - ::DCNv2 version 1
Reading engine from file yolact.trt
[TensorRT] INFO: [MemUsageChange] Init CUDA: CPU +300, GPU +0, now: CPU 361, GPU 5136 (MiB)
[TensorRT] INFO: Loaded engine size: 408 MB
[TensorRT] INFO: [MemUsageSnapshot] deserializeCudaEngine begin: CPU 361 MiB, GPU 5136 MiB
[TensorRT] VERBOSE: Using cublasLt a tactic source
[TensorRT] INFO: [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 791, GPU 5592 (MiB)
[TensorRT] VERBOSE: Using cuDNN as a tactic source
[TensorRT] INFO: [MemUsageChange] Init cuDNN: CPU +439, GPU +172, now: CPU 1230, GPU 5764 (MiB)
[TensorRT] INFO: [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 3110, GPU 7010 (MiB)
[TensorRT] VERBOSE: Deserialization required 16166066 microseconds.
[TensorRT] INFO: [MemUsageSnapshot] deserializeCudaEngine end: CPU 3110 MiB, GPU 7010 MiB
[TensorRT] VERBOSE: Using cublasLt a tactic source
[TensorRT] INFO: [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 3110, GPU 7058 (MiB)
[TensorRT] VERBOSE: Using cuDNN as a tactic source
[TensorRT] INFO: [MemUsageChange] Init cuDNN: CPU +1, GPU +8, now: CPU 3111, GPU 7066 (MiB)
[TensorRT] VERBOSE: Total per-runner device memory is 187544576
[TensorRT] VERBOSE: Total per-runner host memory is 167648
[TensorRT] VERBOSE: Allocated activation device memory of size 93570048
[TensorRT] ERROR: 1: [scaleRunner.cpp::execute::139] Error Code 1: Cuda Runtime (an illegal memory access was encountered)
Traceback (most recent call last):
  File "/home/ubuntu/.pycharm_helpers/pydev/pydevd.py", line 1668, in <module>
    main()
  File "/home/ubuntu/.pycharm_helpers/pydev/pydevd.py", line 1662, in main
    globals = debugger.run(setup['file'], None, None, is_module)
  File "/home/ubuntu/.pycharm_helpers/pydev/pydevd.py", line 1072, in run
    pydev_imports.execfile(file, globals, locals)  # execute the script
  File "/home/ubuntu/.pycharm_helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "/data1/xuduo/optimize/yolact_dir_0804/inference.py", line 128, in <module>
    trt_outputs = common.do_inference_v2(context, bindings=bindings, inputs=inputs, outputs=outputs, stream=stream)
  File "/data1/xuduo/optimize/yolact_dir_0804/common.py", line 161, in do_inference_v2
    [cuda.memcpy_dtoh_async(out.host, out.device, stream) for out in outputs]
  File "/data1/xuduo/optimize/yolact_dir_0804/common.py", line 161, in <listcomp>
    [cuda.memcpy_dtoh_async(out.host, out.device, stream) for out in outputs]
pycuda._driver.LogicError: cuMemcpyDtoHAsync failed: an illegal memory access was encountered

Environment

  • Ubuntu18.04
  • GeForce RTX 2080TI
  • Driver Version 450.51.06
  • NVIDIA-SMI 450.51.06
  • CUDA Version: 11.0
  • python3.6.8
  • Cmake3.13.0
  • CUDA toolkit 11.0.221
  • CUDNN8.05
  • TensorRT8.0-EA(Early Access)
  • onnx1.6
  • onnx-tensorrt8.0-EA

Relevant Files

inference.py

from PIL import Image
    import numpy as np
    import time
    import cv2
    import glob
    import config as cfg
    import torch.nn.functional as F
    import sys

    sys.path.insert(1, os.path.join(sys.path[0], ".."))
    import common

    print("sys.path[0]", sys.path[0])

    TRT_LOGGER = trt.Logger(trt.Logger.VERBOSE)
    trt.init_libnvinfer_plugins(TRT_LOGGER, '')
    EXPLICIT_BATCH = 1 << (int)(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)


    def GiB(val):
        return val * 1 << 30


    def preprocess_image(path):
        # img [h,w,c]
        image = cv2.imread(image_name)
        img_raw_data = cv2.imencode('.jpg', image)[1].tobytes()
        img_data = cv2.imdecode(np.asarray(bytearray(img_raw_data), dtype=np.uint8),
                                cv2.IMREAD_COLOR)
        frame = torch.from_numpy(img_data).cuda().float()
        # print(frame.size)
        batch = FastBaseTransform()(frame.unsqueeze(0))
        return batch


    class FastBaseTransform(torch.nn.Module):
        """
        Transform that does all operations on the GPU for super speed.
        This doesn't suppport a lot of config settings and should only be used for production.
        Maintain this as necessary.
        """

        def __init__(self):
            super().__init__()

            self.mean = torch.Tensor(cfg.MEANS).float().cuda()[None, :, None, None]
            self.std = torch.Tensor(cfg.STD).float().cuda()[None, :, None, None]
            self.transform = cfg.resnet_transform

        def forward(self, img):
            self.mean = self.mean.to(img.device)
            self.std = self.std.to(img.device)

            # img assumed to be a pytorch BGR image with channel order [n, h, w, c]

            img_size = (cfg.max_size, cfg.max_size)
            # 图片转为[n,c,h,w]格式
            img = img.permute(0, 3, 1, 2).contiguous()
            img = F.interpolate(img, img_size, mode='bilinear', align_corners=False)

            if self.transform.normalize:
                img = (img - self.mean) / self.std
            elif self.transform.subtract_means:
                img = (img - self.mean)
            elif self.transform.to_float:
                img = img / 255

            if self.transform.channel_order != 'RGB':
                raise NotImplementedError

            img = img[:, (2, 1, 0), :, :].contiguous()

            # Return value is in channel order [n, c, h, w] and RGB
            return img


    # TRT_LOGGER = trt.Logger(trt.Logger.ERROR)
    if __name__ == "__main__":
        onnx_file_path = 'yolact.onnx'
        engine_file_path = "yolact.trt"

        # threshold = 0.5
        image_name = "/data1/xuduo/optimize/yolact_dir_0804/img/material_WholeBookQuestionData_7H1110B44850N_QuestionBookImage20210713083208_164_586_7H1110B44850N.jpg"
        if not os.path.exists(engine_file_path):
            print("no engine file")
            # conver_engine(onnx_file_path, engine_file_path)
        print(f"Reading engine from file {engine_file_path}")
        # preprocess_time = 0
        # process_time = 0
        f = open(engine_file_path, "rb")
        runtime = trt.Runtime(TRT_LOGGER)
        engine = runtime.deserialize_cuda_engine(f.read())

        # Allocate buffers and create a CUDA stream.
        inputs, outputs, bindings, stream = common.allocate_buffers(engine)
        # Contexts are used to perform inference.
        context = engine.create_execution_context()
        batch = preprocess_image(image_name)
        np.copyto(inputs[0].host, batch.cpu().numpy().ravel())
        trt_outputs = common.do_inference_v2(context, bindings=bindings, inputs=inputs, outputs=outputs, stream=stream)
        print("预测结果:", trt_outputs)

Hi @970321535
It seems to be memory access related issue.
Could you please share the repro steps so we can help better?

Thanks