I would like to run TensorRT inference engine in a ROS callback. If cuda is auto initialised and allocated buffer in the main thread, it complains during inference as below
Hi @zdai257,
Can you please share your model and script so that i can try reproducing the issue.
Also to address the context, An engine can have multiple execution contexts, allowing one set of weights to be used for multiple overlapping inference tasks. For example, you can process images in parallel CUDA streams using one engine and one context per stream. Each context will be created on the same GPU as the engine.
Thanks!
thanks for your reply.
I am afraid I cannot share the model, but it can be reproduced with ANY serialized engine. Error occurred when it subscribes to a ROS Empty (“/empty_topic”) message.
import os
import numpy as np
import tensorrt as trt
import rospy
from std_msgs.msg import Empty
import pycuda.driver as cuda
import pycuda.autoinit
class HostDeviceMem(object):
def __init__(self, host_mem, device_mem):
self.host = host_mem
self.device = device_mem
def __str__(self):
return "Host:\n" + str(self.host) + "\nDevice:\n" + str(self.device)
def __repr__(self):
return self.__str__()
def allocate_buffers(eng):
inputs_list = []
inputs_rand = []
outputs_list = []
bindings_list = []
stream0 = cuda.Stream()
for binding in eng:
size = trt.volume(eng.get_binding_shape(binding)) * eng.max_batch_size
dtype = np.float32
host_mem = cuda.pagelocked_empty(size, dtype)
device_mem = cuda.mem_alloc(host_mem.nbytes)
bindings_list.append(int(device_mem))
if eng.binding_is_input(binding):
inputs_list.append(HostDeviceMem(host_mem, device_mem))
inputs_rand.append(np.ascontiguousarray(np.random.random_sample(eng.get_binding_shape(binding))))
else:
outputs_list.append(HostDeviceMem(host_mem, device_mem))
return inputs_list, inputs_rand, outputs_list, bindings_list, stream0
def do_inference(ctx, bds, inpts, outs, strm, batch_size=1):
[cuda.memcpy_htod_async(inp.device, inp.host, strm) for inp in inpts]
ctx.execute_async(batch_size=batch_size, bindings=bds, stream_handle=strm.handle)
[cuda.memcpy_dtoh_async(out.host, out.device, strm) for out in outs]
strm.synchronize()
return [out.host for out in outs]
def ros_callback(msg):
with engine.create_execution_context() as context:
for i in range(len(input0)):
np.copyto(inputs[i].host, input0[i].ravel())
predict = do_inference(context, bindings, inputs, outputs, stream)
with open('ANY.engine', 'rb') as f, trt.Runtime(trt.Logger(trt.Logger.VERBOSE)) as runtime:
engine = runtime.deserialize_cuda_engine(f.read())
inputs, inputs0, outputs, bindings, stream = allocate_buffers(engine)
node_h = rospy.init_node('main_node', anonymous=False)
sub_h = rospy.Subscriber("/empty_topic", Empty, ros_callback)
rospy.spin()
The program is working on Jetson AGX Xavier which doesn’t use NVIDIA driver but L4T. Also the inference does work if it’s looping in main thread. This error only occurs in ROS callback, so I suspect it’s a context/threading problem.
hi, i am have the same issue as doing inference in ros callback.
the engine is only built for the first callback, saved to a global var,
fixed some context issue by use ctx.push() …and then ctx.pop()
but the gpu memory keeps going until OOM happen.
what should i do?
I’m noticing this error message has been reported elsewhere:
I am getting this error when I have tensorRT and another pycuda program both accessing the GPU (one after the other) in a ROS callback like documented here
@zdai257 I may not have the same setup as you, but I am successfully running tensorRT inference (and other pycuda calls to the GPU) from a ROS callback, sort of.
Here’s my hacky work around. I would love for NVidia devs to comment as I’m sure there’s a better way:
write a main loop as is commonly done with while not rospy.is_shutdown(). Just before this while loop, you’re going to need to setup your tensorRT engine, allocate all memory, start your stream, etc. Then, create your context such as by using a with statement and your tensorRT engine like: with myengine.create_execution_context() as context:
create a very simple callback that takes the input dataset and adds it to a thread-safe queue
back in your while loop, where you have your context created, you then wait for this queue to populate with data, e.g.: `mydata.get(block=True, timeout=x). This waits, so no need to sleep inside your while loop.
With your context, memory allocation, engine, stream, and everything setup in the same thread before your while loop, once your queue is populated with a message you simply call the function that runs inference just like any regular python tensorRT program. The secret sauce to this approach is that all ROS multi-threaded callbacks have been abstracted away by just your queue.
so the extremely simplified psuedo-code is like:
def ros_callback(data):
mydata.put(data)
def main_loop():
# create ros subscribers, publishers, queue, etc...
# setup engine, allocate memory, create cuda stream, etc...
with myengine.create_execution_context() as context:
while not rospy.is_shutdown():
data = mydata.get(block=True, timeout=x)
do_inference(context, data)
rospy.spin()
I was thinking similar methods by leaving the TensorRT engine in a main loop and ROS only writes things to a buffer. This could be a useful way around. Not so much for timing-critical task. Hope there is more support to this.
I basically dumped TensorRT as it required explicit batch size.
I think we need two contexts to use tensorRT engine
when you initialize system, you create two contexts
-context for device: self.cuda_ctx = cuda.Device(0).make_context()
-context for engine: self.context = self.engine.create_execution_context()
and in callback function, you add cuda_ctx.push() and pop() at the begging and ends of function
like following pseudo code