Hi all,
Using Tensor RT 5.0, I was able to improve the run time performances. However, there is still something that I am not quite sure to get regarding the execution for multiple batches. Basically, the sample python code provided in the doc goes like this: copy inputs to device, run inference, copy outputs back to host:
# Transfer input data to the GPU.
[cuda.memcpy_htod_async(inp.device, inp.host, stream) for inp in inputs]
# Run inference.
context.execute_async(bindings=bindings, stream_handle=stream.handle)
# Transfer predictions back from the GPU.
[cuda.memcpy_dtoh_async(out.host, out.device, stream) for out in outputs]
Several questions about this:
-
Why is 1 the default batch size for execute_async? I would have expected it to be max_batch_size. However I get very poor run-time performances when increasing this batch size so I’m not quite sure I understand what is behind it.
-
How would I go to asynchronously start preparing the next batch while the GPU is handling the current batch? Something like this: https://www.tensorflow.org/performance/datasets_performance#pipelining
I understand something could be done with monitoring the stream, or using theinput_consumed
argument to theexecute_async
function but I have no idea how to do so.
Any help would be hugely appreciated
Thanks
Edit: using TRT 5.0.0.10 with Cuda 9.0 and CUDNN 7.3 with pyhon 2.7