TensorRT inference context in ROS callback



I would like to run TensorRT inference engine in a ROS callback. If cuda is auto initialised and allocated buffer in the main thread, it complains during inference as below

[TensorRT] ERROR: ../rtSafe/safeContext.cpp (133) - Cudnn Error in configure: 7 (CUDNN_STATUS_MAPPING_ERROR)
[TensorRT] ERROR: FAILED_EXECUTION: std::exception

If I manually initialise pycuda in the callback as below,

def ros_callback(msg):
    device = cuda.Device(0)
    context = device.make_context()


    del context

the context created is not what I want from a deserialized engine:

with engine.create_execution_context() as context:


Could anyone help with my confusion?
Many thanks.


TensorRT Version:
CUDA Version: 10.2
CUDNN Version: 8.0
Operating System + Version: Ubuntu 18.08, Jetpack 4.4
Python Version (if applicable): 3.6.9
ROS Version: Melodic

Hi @zdai257,
Can you please share your model and script so that i can try reproducing the issue.

Also to address the context, An engine can have multiple execution contexts, allowing one set of weights to be used for multiple overlapping inference tasks. For example, you can process images in parallel CUDA streams using one engine and one context per stream. Each context will be created on the same GPU as the engine.


thanks for your reply.
I am afraid I cannot share the model, but it can be reproduced with ANY serialized engine. Error occurred when it subscribes to a ROS Empty ("/empty_topic") message.

import os
import numpy as np
import tensorrt as trt
import rospy
from std_msgs.msg import Empty
import pycuda.driver as cuda
import pycuda.autoinit

class HostDeviceMem(object):
    def __init__(self, host_mem, device_mem):
        self.host = host_mem
        self.device = device_mem

    def __str__(self):
        return "Host:\n" + str(self.host) + "\nDevice:\n" + str(self.device)

    def __repr__(self):
        return self.__str__()

def allocate_buffers(eng):
    inputs_list = []
    inputs_rand = []
    outputs_list = []
    bindings_list = []
    stream0 = cuda.Stream()
    for binding in eng:
        size = trt.volume(eng.get_binding_shape(binding)) * eng.max_batch_size
        dtype = np.float32
        host_mem = cuda.pagelocked_empty(size, dtype)
        device_mem = cuda.mem_alloc(host_mem.nbytes)
        if eng.binding_is_input(binding):
            inputs_list.append(HostDeviceMem(host_mem, device_mem))
            outputs_list.append(HostDeviceMem(host_mem, device_mem))
    return inputs_list, inputs_rand, outputs_list, bindings_list, stream0

def do_inference(ctx, bds, inpts, outs, strm, batch_size=1):
    [cuda.memcpy_htod_async(inp.device, inp.host, strm) for inp in inpts]
    ctx.execute_async(batch_size=batch_size, bindings=bds, stream_handle=strm.handle)
    [cuda.memcpy_dtoh_async(out.host, out.device, strm) for out in outs]
    return [out.host for out in outs]
def ros_callback(msg):
    with engine.create_execution_context() as context:
        for i in range(len(input0)):
            np.copyto(inputs[i].host, input0[i].ravel())
        predict = do_inference(context, bindings, inputs, outputs, stream)

with open('ANY.engine', 'rb') as f, trt.Runtime(trt.Logger(trt.Logger.VERBOSE)) as runtime:
    engine = runtime.deserialize_cuda_engine(f.read())
inputs, inputs0, outputs, bindings, stream = allocate_buffers(engine)

node_h = rospy.init_node('main_node', anonymous=False)
sub_h = rospy.Subscriber("/empty_topic", Empty, ros_callback)

Hi @zdai257,
Can you share the verbose log stack with us.

Hi @AakankshaS

Sorry for the late reply. The error message after the first Ros callback reads:

[TensorRT] VERBOSE: Deserialize required 3228122 microseconds.
Binding image_1 has dimension = (1, 1, 64, 256, 3)
Binding image_2 has dimension = (1, 1, 64, 256, 3)
Binding imu_data has dimension = (1, 10, 6)
Binding delta_pose has dimension = (1, 1, 6)

[TensorRT] VERBOSE: myelinAllocCb allocated GPU 139136 bytes at 0x23919a000
[TensorRT] ERROR: ../rtSafe/safeContext.cpp (133) - Cudnn Error in configure: 7 (CUDNN_STATUS_MAPPING_ERROR)
[TensorRT] ERROR: FAILED_EXECUTION: std::exception
[TensorRT] VERBOSE: myelinFreeCb freeing GPU at 0x23919a000


For the specific error, i can suggest you to check the below link.
The issue might be the driver version compatibility



The program is working on Jetson AGX Xavier which doesn’t use NVIDIA driver but L4T. Also the inference does work if it’s looping in main thread. This error only occurs in ROS callback, so I suspect it’s a context/threading problem.

Thanks anyway.

Have you solved this problem ?

hi, i am have the same issue as doing inference in ros callback.
the engine is only built for the first callback, saved to a global var,
fixed some context issue by use ctx.push() …and then ctx.pop()
but the gpu memory keeps going until OOM happen.
what should i do?

I’m noticing this error message has been reported elsewhere:

I am getting this error when I have tensorRT and another pycuda program both accessing the GPU (one after the other) in a ROS callback like documented here

@zdai257 I may not have the same setup as you, but I am successfully running tensorRT inference (and other pycuda calls to the GPU) from a ROS callback, sort of.

Here’s my hacky work around. I would love for NVidia devs to comment as I’m sure there’s a better way:

  1. write a main loop as is commonly done with while not rospy.is_shutdown(). Just before this while loop, you’re going to need to setup your tensorRT engine, allocate all memory, start your stream, etc. Then, create your context such as by using a with statement and your tensorRT engine like: with myengine.create_execution_context() as context:

  2. create a very simple callback that takes the input dataset and adds it to a thread-safe queue

  3. back in your while loop, where you have your context created, you then wait for this queue to populate with data, e.g.: `mydata.get(block=True, timeout=x). This waits, so no need to sleep inside your while loop.

  4. With your context, memory allocation, engine, stream, and everything setup in the same thread before your while loop, once your queue is populated with a message you simply call the function that runs inference just like any regular python tensorRT program. The secret sauce to this approach is that all ROS multi-threaded callbacks have been abstracted away by just your queue.

so the extremely simplified psuedo-code is like:

def ros_callback(data):

def main_loop():
    # create ros subscribers, publishers, queue, etc...
    # setup engine, allocate memory, create cuda stream, etc...
    with myengine.create_execution_context() as context:
        while not rospy.is_shutdown():
            data = mydata.get(block=True, timeout=x)
            do_inference(context, data)


I was thinking similar methods by leaving the TensorRT engine in a main loop and ROS only writes things to a buffer. This could be a useful way around. Not so much for timing-critical task. Hope there is more support to this.

I basically dumped TensorRT as it required explicit batch size.

This worked very well for me!