Using Tensor RT 5.0, I was able to improve the run time performances. However, there is still something that I am not quite sure to get regarding the execution for multiple batches. Basically, the sample python code provided in the doc goes like this: copy inputs to device, run inference, copy outputs back to host:
# Transfer input data to the GPU. [cuda.memcpy_htod_async(inp.device, inp.host, stream) for inp in inputs] # Run inference. context.execute_async(bindings=bindings, stream_handle=stream.handle) # Transfer predictions back from the GPU. [cuda.memcpy_dtoh_async(out.host, out.device, stream) for out in outputs]
Several questions about this:
Why is 1 the default batch size for execute_async? I would have expected it to be max_batch_size. However I get very poor run-time performances when increasing this batch size so I’m not quite sure I understand what is behind it.
How would I go to asynchronously start preparing the next batch while the GPU is handling the current batch? Something like this: https://www.tensorflow.org/performance/datasets_performance#pipelining
I understand something could be done with monitoring the stream, or using the
input_consumedargument to the
execute_asyncfunction but I have no idea how to do so.
Any help would be hugely appreciated
Edit: using TRT 18.104.22.168 with Cuda 9.0 and CUDNN 7.3 with pyhon 2.7