Loading batches with TensorRT python interface

Description

The TensorRT python samples include the following code for performing inference:

# This function is generalized for multiple inputs/outputs.
# inputs and outputs are expected to be lists of HostDeviceMem objects.
def do_inference(context, bindings, inputs, outputs, stream, batch_size=1):
    # Transfer input data to the GPU.
    [cuda.memcpy_htod_async(inp.device, inp.host, stream) for inp in inputs]
    # Run inference.
    context.execute_async(batch_size=batch_size, bindings=bindings, stream_handle=stream.handle)
    # Transfer predictions back from the GPU.
    [cuda.memcpy_dtoh_async(out.host, out.device, stream) for out in outputs]
    # Synchronize the stream
    stream.synchronize()
    # Return only the host outputs.
    return [out.host for out in outputs]

Other framework (such as Tensorflow) have data loading mechanisms that load the next batch to the GPU while it is processing the current batch in order to better utilize the GPU. I couldn’t find TensorRT samples that work in this manner (only ones like the sample above). How do I implement such a mechanism using TensorRT’s python interface?

Environment

TensorRT Version: 7.0.0

Hi @trillian.2020.09.01,
Request you to check the below link
https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#perform_inference_python

Thanks!

Thanks!
Procedure 2 in this link is similar to the example that I sent. However, it seems that the next batch is copied to the GPU ( cuda.memcpy_htod_async) only after the stream.synchronize() was called for the current batch. That is, only after the current batch finished running and was copied back to the CPU and not while it was running. Am I missing something here?

Hi @trillian.2020.09.01,

You can launch the TensorRT models with separate execution context.
You can find some suggestions for TensorRT with multithread here:
https://docs.nvidia.com/deeplearning/sdk/tensorrt-best-practices/index.html#thread-safety

Thanks!

Thanks! So what you suggest is that I will write a second thread that pre-fetches the data of the next batch?
Is there any builtin mechanism or code sample for that? (I guess this scenario is quite common among TensorRT users who are interested in low inference time, so any time that it takes to copy data around reduces the performance improvements that we gained from optimizing the network with TensorRT).

Hi @trillian.2020.09.01,
I am afraid, we dont have any sample available,
However you can find some assistance from the below link

Thanks!