[TensorRT] ERROR: 1: [resize.cu::performLinearKernelLaunch::457] Error Code 1: Cuda Runtime (invalid argument)

Description

Edit 3 hours later:I find the problem is caused by stream.use(), comment it and solve the problem.

Original problem:
I try to use cupy to process data and set bindings equal to the cupy data ptr. But when I use tensorrt to infer, I got an error
[TensorRT] ERROR: 1: [resize.cu::performLinearKernelLaunch::457] Error Code 1: Cuda Runtime (invalid argument).
I try to debug and find nan in outputs.

I am confused with how to use cupy in tensorrt. Could you give an example?

My code is as follows:

def infer(self, input):
    # use pycuda context
    self.cfx.push()
    stream = cp.cuda.Stream(non_blocking=False)
    stream.use()
    bindings = []
    # the input is a list [cp.array(), cp.array()]
    for index, data in enumerate(input):
        self.context.set_binding_shape(index, data.shape)
        # bindings append the cupy data ptr
        bindings.append(int(data.data))
     
    # set output
    outputs = []
    for binding in self.engine:
        if not self.engine.binding_is_input(binding):
            size = trt.volume(
                self.context.get_binding_shape(self.engine.get_binding_index(binding))) * 
                self.engine.max_batch_size * 2
            device_mem = cp.cuda.alloc(size)
            bindings.append(int(device_mem))
            outputs.append(device_mem)
 
    self.context.execute_async(bindings=bindings, stream_handle=stream.ptr)
    results = []
    for idx, (output, shape) in enumerate(zip(outputs, self.output_shape)):
        if idx == 3:
            cpu = np.zeros([1, 1080, 1920], dtype=np.float16)
            output.copy_to_host(cpu.ctypes.data, 1080 * 1920 * 2)
            print(cpu)
    stream.synchronize()
    self.cfx.pop()

Environment

TensorRT Version: 8.0
GPU Type: 2080ti
Nvidia Driver Version: 470
CUDA Version: 11.3
CUDNN Version: 8.2
Operating System + Version: ubuntu18.04
Python Version (if applicable): 3.9
PyTorch Version (if applicable): 1.11

Hi,

Could you please let us know the reason for using stream.use(). I believe stream.synchronize() is enough.
For your reference,

Thank you.

Sorry, I forgot to show the full solution. Actually, this problem is caused by pycuda.

  1. If I delete the stream.use(), it will return the right result but get an error in the second inference because the stream is not in use.
  2. Also I find an error size and fix it.
  3. I finally solve the problem by deleting the cuda.init() and cuda.make_context()functions. I don’t know whether the cupy has any conflict with pycuda.

Does cupy have default context? If you have any thinking about this, please let me know. Thanks very much!

And my solved code is as follows and finally can get the right result.

def infer(self, input):
    with cp.cuda.Stream(non_blocking=False) as stream:
        bindings = []
        for index, data in enumerate(input):
            self.context.set_binding_shape(index, data.shape)
            bindings.append(int(data.data))

        outputs = []
        for binding in self.engine:
            if not self.engine.binding_is_input(binding):
                device_mem = cp.zeros(self.context.get_binding_shape(self.engine.get_binding_index(binding)), dtype=cp.float32)

                bindings.append(int(device_mem.data))
                outputs.append(device_mem)

        self.context.execute_async(bindings=bindings, stream_handle=stream.ptr)
        stream.synchronize()

Actually, cupy is a very good job. I use cupy to preprocess and post-process, and got a 260% boost! Thanks for nvidia job!

Thanks for updating the post with the root cause of the problem.
It really helps the community - thanks.
Good luck with your project.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.