The TensorRT python samples include the following code for performing inference:
# This function is generalized for multiple inputs/outputs. # inputs and outputs are expected to be lists of HostDeviceMem objects. def do_inference(context, bindings, inputs, outputs, stream, batch_size=1): # Transfer input data to the GPU. [cuda.memcpy_htod_async(inp.device, inp.host, stream) for inp in inputs] # Run inference. context.execute_async(batch_size=batch_size, bindings=bindings, stream_handle=stream.handle) # Transfer predictions back from the GPU. [cuda.memcpy_dtoh_async(out.host, out.device, stream) for out in outputs] # Synchronize the stream stream.synchronize() # Return only the host outputs. return [out.host for out in outputs]
Other framework (such as Tensorflow) have data loading mechanisms that load the next batch to the GPU while it is processing the current batch in order to better utilize the GPU. I couldn’t find TensorRT samples that work in this manner (only ones like the sample above). How do I implement such a mechanism using TensorRT’s python interface?
TensorRT Version: 7.0.0