RuntimeError: CUDA error: invalid configuration argument

Description

I’m using tensorrt to run a mask-rcnn model, and using pytorch to postprocess the result. when the inference result contains more than 2 bounding boxes, and I print the result, a GPU tensor, it raises an error:“RuntimeError: CUDA error: invalid configuration argument”. But I can print the tensor after I convert it to cpu. While the inference result contains less than 2 bounding boxes, I can print the tensor in both CPU and GPU.
Can anyone help ?

Environment

Environment is a docker image : nvcr.io/nvidia/tensorrt:21.09-py3
TensorRT Version: 8.0.3.0
GPU Type: T4
Nvidia Driver Version: 450.51.06
CUDA Version: 11.4
CUDNN Version:
Operating System + Version:
Python Version (if applicable): 3.8
TensorFlow Version (if applicable):
PyTorch Version (if applicable): 1.10.0+cu113
Baremetal or Container (if container which image + tag):

Steps To Reproduce


from PIL import Image
import numpy as np
import torch
import ctypes
import tensorrt as trt

engine_path = 'engine.fp32'
image_path='COCO_val2014_000000002640.jpg'
original_image = np.array(Image.open(image_path).convert('RGB'))[:, :, ::-1]
dtype = torch.float32
outputs_tensor = {
    "scores": torch.zeros(size=(100, 1), dtype=dtype, device=torch.device('cuda')),
    "boxes": torch.zeros(size=(100, 4), dtype=dtype, device=torch.device('cuda')),
    "labels": torch.zeros(size=(100, 1), dtype=dtype, device=torch.device('cuda')),
    "masks": torch.zeros(size=(100, 1, 28, 28), dtype=dtype, device=torch.device('cuda')),
}

PLUGIN_LIBRARY = 'libmyplugins.so'
ctypes.CDLL(PLUGIN_LIBRARY)
print('init_libnvinfer_plugins success: ',
      trt.init_libnvinfer_plugins(None, "")
      )
with trt.Logger() as logger, trt.Runtime(logger) as runtime:
    with open(engine_path, mode='rb') as f:
        engine_bytes = f.read()
    engine = runtime.deserialize_cuda_engine(engine_bytes)
context = engine.create_execution_context()
input_tensor = torch.as_tensor(original_image.astype("float32")).to(torch.device('cuda'))

bindings = [None] * 5

bindings[engine.get_binding_index("images")] = input_tensor.contiguous().data_ptr()
for output_name, output_tensor in outputs_tensor.items():
    idx = engine.get_binding_index(output_name)
    bindings[idx] = output_tensor.data_ptr()

context.execute_async(1, bindings, torch.cuda.current_stream().cuda_stream)

print('shape: ', outputs_tensor['scores'].shape)
print('scores cpu: ', outputs_tensor['scores'].cpu()[:5])  # No Error


# when the value of more than two elements  are greater than the threshold, the following error will be reported 
# # RuntimeError: CUDA error: invalid configuration argument .
# when less than two elements  are greater than the threshold,  no error
print('scores:', outputs_tensor['scores'][:5])
print(torch.randn((2, 4), device='cuda'))

Hi,
We recommend you to check the below samples links in case of tf-trt integration issues.
https://docs.nvidia.com/deeplearning/frameworks/tf-trt-user-guide/index.html#samples
https://docs.nvidia.com/deeplearning/tensorrt/quick-start-guide/index.html#framework-integration
https://docs.nvidia.com/deeplearning/frameworks/tf-trt-user-guide/index.html#integrate-ovr
https://docs.nvidia.com/deeplearning/frameworks/tf-trt-user-guide/index.html#usingtftrt

If issue persist, We recommend you to reach out to Tensorflow forum.
Thanks!