Cuda Runtime (invalid resource handle) when use TensorRT and Pytorch(on GPU) simultaneously

Description

Hi,
I’m using TensorRT to do object detection(SSD). The engine runs well, but when I move the trt_output to GPU, i got Cuda Runtime (invalid resource handle).

Environment

TensorRT Version: 8.0.0.3
GPU Type: T4
Nvidia Driver Version: 450
CUDA Version: 11.0
CUDNN Version: 8.2.0
Operating System + Version: Centos7
Python Version (if applicable): 3.7
TensorFlow Version (if applicable):
PyTorch Version (if applicable): 1.8
Baremetal or Container (if container which image + tag):

Relevant Files

Please attach or include links to any models, data, files, or scripts necessary to reproduce your issue. (Github repo, Google Drive, Dropbox, etc.)

Steps To Reproduce

Here is my code.

class TRTInference:
    def __init__(self, engine_file_path):
        self.engine_file_path = engine_file_path
        self.engine = self.get_engine()
        self.context = self.get_context()
        self.inputs, self.outputs, self.bindings, self.stream = self.allocate_buffers()

    def get_engine(self):
        trt.init_libnvinfer_plugins(None, '')
        with open(self.engine_file_path, "rb") as f:
            return runtime.deserialize_cuda_engine(f.read())

    def get_context(self):
        return self.engine.create_execution_context()

    def allocate_buffers(self):
        inputs = []
        outputs = []
        bindings = []
        stream = cuda.Stream()
        for binding in self.engine:
            if self.engine.binding_is_input(binding):
                size = self.input_shape()
            else:
                size = self.output_shape(binding)
            dtype = trt.nptype(self.engine.get_binding_dtype(binding))
            # Allocate host and device buffers
            host_mem = cuda.pagelocked_empty(size, dtype)
            device_mem = cuda.mem_alloc(host_mem.nbytes)
            # Append the device buffer to device bindings.
            bindings.append(int(device_mem))
            # Append to the appropriate list.
            if self.engine.binding_is_input(binding):
                inputs.append(HostDeviceMem(host_mem, device_mem))
            else:
                outputs.append(HostDeviceMem(host_mem, device_mem))
        return inputs, outputs, bindings, stream

    def do_inference(self, img, img_h, img_w, batch_size):
        self.inputs[0].host = img
        self.context.active_optimization_profile = 0
        self.context.set_binding_shape(0, (batch_size, img_h, img_w, 3))
        # Transfer data from CPU to the GPU.
        [cuda.memcpy_htod_async(inp.device, inp.host, self.stream) for inp in self.inputs]
        # Run inference.
        self.context.execute(batch_size=batch_size, bindings=self.bindings)
        # Transfer predictions back from the GPU.
        [cuda.memcpy_dtoh_async(out.host, out.device, self.stream) for out in self.outputs]
        # Return only the host outputs.
        trt_outputs = [out.host for out in self.outputs]
        return trt_outputs

Now, the type of trt_outputs is numpy array.
And then I do trt_outputs[0] = torch.from_numpy(trt_outputs[0]).cuda(), the error appears.
I believe this is something wrong related to the cuda context. When I call .cuda(), a new cuda context is initialized inside Pytorch. When TensorRT starts to do inference , it will use the wrong cuda context. How to deal with this issue?
Please include:
Traceback:

[TensorRT] ERROR: 1: [convolutionRunner.cpp::checkCaskExecError<false>::440] Error Code 1: Cask (Cask Convolution execution)
[TensorRT] ERROR: 1: [apiCheck.cpp::apiCatchCudaError::17] Error Code 1: Cuda Runtime (invalid resource handle)

@751180903,

Could you please give more details, are you using pytorch and pycuda in the same module(together) ?

I have a similar problem with TensorFlow. I use a TensorRT model on GPU and then I want to process some results using a TensorFlow convolution. The first call to context.execute_async_v2 runs correctly, after which I run the convolution. Then for the second frame I call context.execute_async_v2 and it returns the same two errors the original post.

This problem does not occur when I do not use the convolution.

I thought that running the convolution on the CPU would fix it but then it returns another error:

[02/21/2022-14:48:07] [TRT] [E] 1: [context.cpp::setStream::121] Error Code 1: Cudnn (CUDNN_STATUS_MAPPING_ERROR)