TRT5.0 Python API : how would I go to asynchronously load batches?

blancpaques · October 16, 2018, 2:27pm

Hi all,

Using Tensor RT 5.0, I was able to improve the run time performances. However, there is still something that I am not quite sure to get regarding the execution for multiple batches. Basically, the sample python code provided in the doc goes like this: copy inputs to device, run inference, copy outputs back to host:

# Transfer input data to the GPU.
[cuda.memcpy_htod_async(inp.device, inp.host, stream) for inp in inputs]

# Run inference.
context.execute_async(bindings=bindings, stream_handle=stream.handle)

# Transfer predictions back from the GPU.
[cuda.memcpy_dtoh_async(out.host, out.device, stream) for out in outputs]

Several questions about this:

Why is 1 the default batch size for execute_async? I would have expected it to be max_batch_size. However I get very poor run-time performances when increasing this batch size so I’m not quite sure I understand what is behind it.
How would I go to asynchronously start preparing the next batch while the GPU is handling the current batch? Something like this: https://www.tensorflow.org/performance/datasets_performance#pipelining
I understand something could be done with monitoring the stream, or using the input_consumed argument to the execute_async function but I have no idea how to do so.

Any help would be hugely appreciated

Thanks

Edit: using TRT 5.0.0.10 with Cuda 9.0 and CUDNN 7.3 with pyhon 2.7

NVES · October 17, 2018, 6:08pm

Hello,

Regarding batch size, generally, performance scales with batch size, but it is network-specific. If we could see a .pb/.UFF, then we could investigate why this is not the case here.

Regarding asynchronously prep batch, we don’t currently have a way to use PyCUDA Events with TRT, so using input_consumed may not work with pycuda. You may want to consult the CUDA forum, but you could probably use two separate input and output buffers and overlap context.execute_async for one set of buffers with memcpys for the other.

blancpaques · October 17, 2018, 7:57pm

OK thanks, I’ll give it a try

Topic		Replies	Views
Loading batches with TensorRT python interface TensorRT	5	606	September 8, 2020
How to do Batch Execution on TensorRT Jetson AGX Xavier tensorrt	3	930	July 13, 2021
Batch execution of trt model TensorRT cudnn	1	569	February 29, 2024
Tensorrt inference with batch > 1 TensorRT	4	1499	October 13, 2022
Performance metrics of Native TensorRT 5.0 using Python API TensorRT	6	1460	March 12, 2025
Synchronized inference or Asynchronized inference TensorRT	1	4599	December 5, 2018
Help with batch inputs in test in Python API TensorRT	1	885	February 18, 2020
Batch inference parallelization on tensorrt DeepStream SDK tensorrt	1	559	April 23, 2021
TenorRT with python: execution return zeros if batch_size > 1 TensorRT	1	881	November 20, 2020
TensorRT Batching Speed scales poorly TensorRT tensorrt , cuda	6	1931	September 30, 2021

TRT5.0 Python API : how would I go to asynchronously load batches?

Related topics