TRT5.0 Python API : how would I go to asynchronously load batches?

Hi all,

Using Tensor RT 5.0, I was able to improve the run time performances. However, there is still something that I am not quite sure to get regarding the execution for multiple batches. Basically, the sample python code provided in the doc goes like this: copy inputs to device, run inference, copy outputs back to host:

# Transfer input data to the GPU.
[cuda.memcpy_htod_async(inp.device, inp.host, stream) for inp in inputs]

# Run inference.
context.execute_async(bindings=bindings, stream_handle=stream.handle)

# Transfer predictions back from the GPU.
[cuda.memcpy_dtoh_async(out.host, out.device, stream) for out in outputs]

Several questions about this:

  • Why is 1 the default batch size for execute_async? I would have expected it to be max_batch_size. However I get very poor run-time performances when increasing this batch size so I’m not quite sure I understand what is behind it.

  • How would I go to asynchronously start preparing the next batch while the GPU is handling the current batch? Something like this: https://www.tensorflow.org/performance/datasets_performance#pipelining
    I understand something could be done with monitoring the stream, or using the input_consumed argument to the execute_async function but I have no idea how to do so.

Any help would be hugely appreciated

Thanks

Edit: using TRT 5.0.0.10 with Cuda 9.0 and CUDNN 7.3 with pyhon 2.7

Hello,

Regarding batch size, generally, performance scales with batch size, but it is network-specific. If we could see a .pb/.UFF, then we could investigate why this is not the case here.

Regarding asynchronously prep batch, we don’t currently have a way to use PyCUDA Events with TRT, so using input_consumed may not work with pycuda. You may want to consult the CUDA forum, but you could probably use two separate input and output buffers and overlap context.execute_async for one set of buffers with memcpys for the other.

OK thanks, I’ll give it a try