CUDA cask failure at execution for trt_maxwell_scudnn_128x32_relu_small_nn_v1.

Hi all

i tried to run a tensorrt engine to inference images in the ROS callbak function, but got err log saying

[TensorRT] ERROR: CUDA cask failure at execution for trt_maxwell_scudnn_128x32_relu_small_nn_v1.
[TensorRT] ERROR: cuda/caskConvolutionLayer.cpp (355) - Cuda Error in execute: 33 (invalid resource handle)
[TensorRT] ERROR: cuda/caskConvolutionLayer.cpp (355) - Cuda Error in execute: 33 (invalid resource handle)

i init the tensorrt engine, get context and allocate memory in main thread, receive images and do inference in callbak function, but could inference local pictures correctly in main thread. my code is like this:

import numpy as np
import tensorrt as trt
import pycuda.driver as cuda
import rospy
from sensor_msgs.msg import Image

def callback(data):
    cv_bridge = CvBridge()
    input_img_np = cv_bridge.imgmsg_to_cv2(data, desired_encoding="rgb8")
    input_img = pilImage.fromarray(input_img_np)
    <b>global engine, inputs, outputs, bindings, stream, context</b>
    np.copyto(inputs[0].host, input_img) #copy images to mem
    [cuda.memcpy_htod_async(inp.device,, stream) for inp in inputs]  # Transfer input data to the GPU.
    context.execute_async(batch_size=batch_size, bindings=bindings, stream_handle=stream.handle)  # Run inference. # error here, return false
    [cuda.memcpy_dtoh_async(, out.device, stream) for out in outputs]
    return [ for out in outputs]

def loadTrtEngine(engine_file_name, logger):
    with open(engine_file_name, "rb") as f, trt.Runtime(logger) as runtime:
        engine = runtime.deserialize_cuda_engine(
        inputs = []
        outputs = []
        bindings = []
        stream = cuda.Stream()
        for binding in engine:
            size = trt.volume(engine.get_binding_shape(binding)) * engine.max_batch_size
            dtype = trt.nptype(engine.get_binding_dtype(binding))
            # Allocate host and device buffers
            host_mem = cuda.pagelocked_empty(size, dtype)
            device_mem = cuda.mem_alloc(host_mem.nbytes)
            # Append the device buffer to device bindings.
            # Append to the appropriate list.
            if engine.binding_is_input(binding):
                inputs.append(HostDeviceMem(host_mem, device_mem))
                outputs.append(HostDeviceMem(host_mem, device_mem))
        return engine, inputs, outputs, bindings, stream

if __name__ == '__main__':
    TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
    trt.init_libnvinfer_plugins(TRT_LOGGER, '')
    <b>engine, inputs, outputs, bindings, stream</b> = loadTrtEngine('./frozen_yolo_model_aw.engine', TRT_LOGGER) 
    context = engine.create_execution_context()
    rospy.Subscriber('/pandora_raw0', Image, callback, queue_size=1, buff_size=2 ** 24)

i found that run inference in main thread could solve this problem, but when implemented in ROS, i have to run inference in callback fuction, how could i fix this problem? thanks for any advice.

Ubuntu: 16.04 64bit
CUDA: 10.0
CUDNN: 7.5.0
TensorRT: 5.1.5
python: 3.6
ROS: kinetic
GPU: Geforce GTX 1080 with Max-Q
NVIDIA Driver Version: 430.40


Please refer to below sample example for deep learning inference nodes for ROS: